Statistical Inference Through Data Compression

Statistical Inference Through Data Compression Rudi Cilibrasi Statistical Inference Through Data Compression ILLC Dissertation Series DS-2006-08 For further information about ILLC-publications, please contact Institute for Logic, Language and Computation Universiteit van Amsterdam Plantage Muidergracht 24 1018 TV Amsterdam phone: +31-20-525 6051 fax: +31-20-525 5206 e-mail: [email protected] homepage: http://www.illc.uva.nl/ Statistical Inference Through Data Compression ACADEMISCH PROEFSCHRIFT ter verkrijging van de graad van doctor aan de Universiteit van Amsterdam op gezag van de Rector Magnificus prof.mr. P.F. van der Heijden ten overstaan van een door het college voor promoties ingestelde commissie, in het openbaar te verdedigen in de Aula der Universiteit op donderdag 7 september 2006, te 12.00 uur door Rudi Cilibrasi geboren te Brooklyn, NY, USA. Promotiecommissie: Promotor: Prof.dr.ir. P.M.B. Vitányi Co-promotor: Dr. P.D. Grünwald Overige leden: Prof.dr. P. Adriaans Prof.dr. R. Dijkgraaf Prof.dr. M. Li Prof.dr. B. Ryabko Prof.dr. A. Siebes Dr. L. Torenvliet Faculteit der Natuurwetenschappen, Wiskunde en Informatica Copyright © 2006 by Rudi Cilibrasi Printed and bound by PRINTPARTNERS IPSKAMP. ISBN-10: 90–5776–155–7 ISBN-13: 978–90–5776–155–8 My arguments will be open to all, and may be judged of by all. – Publius v Contents 1 Introduction 1 1.1 Overview of this thesis ............................... 1 1.1.1 Data Compression as Learning ....................... 1 1.1.2 Visualization ................................ 4 1.1.3 Learning From the Web .......................... 5 1.1.4 Clustering and Classification ........................ 5 1.2 Gestalt Historical Context .............................. 5 1.3 Contents of this Thesis ............................... 9 2 Technical Introduction 11 2.1 Finite and Infinite .................................. 11 2.2 Strings and Languages ............................... 12 2.3 The Many Facets of Strings ............................. 13 2.4 Prefix Codes ..................................... 14 2.4.1 Prefix Codes and the Kraft Inequality ................... 15 2.4.2 Uniquely Decodable Codes ......................... 15 2.4.3 Probability Distributions and Complete Prefix Codes ........... 16 2.5 Turing Machines .................................. 16 2.6 Kolmogorov Complexity .............................. 18 2.6.1 Conditional Kolmogorov Complexity ................... 19 2.6.2 Kolmogorov Randomness and Compressibility .............. 20 2.6.3 Universality In K .............................. 21 2.6.4 Sophisticated Forms of K .......................... 21 2.7 Classical Probability Compared to K ........................ 21 2.8 Uncomputability of Kolmogorov Complexity ................... 23 2.9 Conclusion ..................................... 24 vii 3 Normalized Compression Distance (NCD) 25 3.1 Similarity Metric .................................. 25 3.2 Normal Compressor ................................. 28 3.3 Background in Kolmogorov complexity ...................... 30 3.4 Compression Distance ............................... 31 3.5 Normalized Compression Distance ......................... 32 3.6 Kullback-Leibler divergence and NCD ....................... 36 3.6.1 Static Encoders and Entropy ........................ 36 3.6.2 NCD and KL-divergence .......................... 38 3.7 Conclusion ..................................... 41 4 A New Quartet Tree Heuristic For Hierarchical Clustering 43 4.1 Summary ...................................... 43 4.2 Introduction ..................................... 44 4.3 Hierarchical Clustering ............................... 46 4.4 The Quartet Method ................................. 46 4.5 Minimum Quartet Tree Cost ............................ 48 4.5.1 Computational Hardness .......................... 49 4.6 New Heuristic .................................... 51 4.6.1 Algorithm .................................. 52 4.6.2 Performance ................................ 53 4.6.3 Termination Condition ........................... 54 4.6.4 Tree Building Statistics ........................... 55 4.6.5 Controlled Experiments .......................... 56 4.7 Quartet Topology Costs Based On Distance Matrix ................ 57 4.7.1 Distance Measure Used ........................... 57 4.7.2 CompLearn Toolkit ............................. 58 4.7.3 Testing The Quartet-Based Tree Construction ............... 58 4.8 Testing On Artificial Data .............................. 59 4.9 Testing On Heterogeneous Natural Data ...................... 60 4.10 Testing on Natural Data ............................... 61 4.10.1 Analyzing the SARS and H5N1 Virus Genomes .............. 62 4.10.2 Music .................................... 62 4.10.3 Mammalian Evolution ........................... 66 4.11 Hierarchical versus Flat Clustering ......................... 68 5 Classification systems using NCD 71 5.1 Basic Classification ................................. 71 5.1.1 Binary and Multiclass Classifiers ..................... 72 5.1.2 Naive NCD Classification ......................... 73 5.2 NCD With Trainable Classifiers .......................... 73 5.2.1 Choosing Anchors ............................. 74 5.3 Trainable Learners of Note ............................. 74 viii 5.3.1 Neural Networks .............................. 74 5.3.2 Support Vector Machines .......................... 75 5.3.3 SVM Theory ................................ 76 5.3.4 SVM Parameter Setting .......................... 77 6 Experiments with NCD 79 6.1 Similarity ...................................... 79 6.2 Experimental Validation .............................. 83 6.3 Truly Feature-Free: The Case of Heterogenous Data ................ 84 6.4 Music Categorization ................................ 85 6.4.1 Details of Our Implementation ....................... 86 6.4.2 Genres: Rock vs. Jazz vs. Classical .................... 87 6.4.3 Classical Piano Music (Small Set) ..................... 88 6.4.4 Classical Piano Music (Medium Set) .................... 89 6.4.5 Classical Piano Music (Large Set) ..................... 90 6.4.6 Clustering Symphonies ........................... 91 6.4.7 Future Music Work and Conclusions .................... 91 6.4.8 Details of the Music Pieces Used ...................... 92 6.5 Genomics and Phylogeny .............................. 93 6.5.1 Mammalian Evolution: ........................... 94 6.5.2 SARS Virus: ................................ 97 6.5.3 Analysis of Mitochondrial Genomes of Fungi: .............. 97 6.6 Language Trees ................................... 98 6.7 Literature ...................................... 99 6.8 Optical Character Recognition ........................... 99 6.9 Astronomy ..................................... 102 6.10 Conclusion ..................................... 102 7 Automatic Meaning Discovery Using Google 105 7.1 Introduction ..................................... 105 7.1.1 Googling For Knowledge .......................... 108 7.1.2 Related Work and Background NGD ................... 111 7.1.3 Outline ................................... 111 7.2 Feature-Free Extraction of Semantic Relations with Google ............ 111 7.2.1 Genesis of the Approach .......................... 112 7.2.2 The Google Distribution .......................... 116 7.2.3 Universality of Google Distribution .................... 118 7.2.4 Universality of Normalized Google Distance ............... 120 7.3 Introduction to Experiments ............................ 123 7.3.1 Google Frequencies and Meaning ..................... 123 7.3.2 Some Implementation Details ....................... 124 7.3.3 Three Applications of the Google Method ................. 124 7.4 Hierarchical Clustering ............................... 124 ix 7.4.1 Colors and Numbers ............................ 125 7.4.2 Dutch 17th Century Painters ........................ 125 7.4.3 Chinese Names ............................... 125 7.5 SVM Learning ................................... 130 7.5.1 Emergencies ................................ 130 7.5.2 Learning Prime Numbers .......................... 131 7.5.3 WordNet Semantics: Specific Examples .................. 131 7.5.4 WordNet Semantics: Statistics ....................... 133 7.6 Matching the Meaning ............................... 135 7.7 Conclusion ..................................... 136 8 Stemmatology 141 8.1 Introduction ..................................... 141 8.2 A Minimum-Information Criterion ......................... 144 8.3 An Algorithm for Constructing Stemmata ..................... 146 8.4 Results and Discussion ............................... 147 8.5 Conclusions ..................................... 151 9 Comparison of CompLearn with PHYLIP 157 10 CompLearn Documentation 165 Bibliography 177 Index 187 11 Nederlands Samenvatting 211 12 Biography 215 x List of Figures 1.1 The evolutionary tree built from complete mammalian mtDNA sequences of 24 species. See Chapter 4, Section 4.10.3 for details. ................. 2 1.2 Several people’s names, political parties, regions, and other Chinese names. See Chapter 7, Section 7.4.3 for more details. ..................... 3 1.3 102 Nobel prize winning writers using CompLearn and NGD. See Chapter 9 for details. ....................................... 6 3.1 A comparison of predicted and observed values for NCDR. ............ 40 4.1 The three possible quartet topologies for the set of leaf labels u,v,w,x. ...... 47 4.2 An example tree consistent with quartet topology uv|wx . ............. 48 4.3 Progress of a 60-item data set

Load more