Statistical Inference Through Data Compression
Total Page:16
File Type:pdf, Size:1020Kb
Statistical Inference Through Data Compression Rudi Cilibrasi Statistical Inference Through Data Compression ILLC Dissertation Series DS-2007-01 For further information about ILLC-publications, please contact Institute for Logic, Language and Computation Universiteit van Amsterdam Plantage Muidergracht 24 1018 TV Amsterdam phone: +31-20-525 6051 fax: +31-20-525 5206 e-mail: [email protected] homepage: http://www.illc.uva.nl/ Statistical Inference Through Data Compression ACADEMISCH PROEFSCHRIFT ter verkrijging van de graad van doctor aan de Universiteit van Amsterdam op gezag van de Rector Magnificus prof.mr. P.F. van der Heijden ten overstaan van een door het college voor promoties ingestelde commissie, in het openbaar te verdedigen in de Aula der Universiteit op vrijdag 23 februari 2007, te 10.00 uur door Rudi Langston Cilibrasi geboren te Brooklyn, New York, Verenigde Staten Promotiecommissie: Promotor: Prof.dr.ir. P.M.B. Vitányi Co-promotor: Dr. P.D. Grünwald Overige leden: Prof.dr. P. Adriaans Prof.dr. R. Dijkgraaf Prof.dr. M. Li Prof.dr. B. Ryabko Prof.dr. A. Siebes Dr. L. Torenvliet Faculteit der Natuurwetenschappen, Wiskunde en Informatica Copyright © 2007 by Rudi Cilibrasi Printed and bound by PRINTPARTNERS IPSKAMP. ISBN: 90–6196–540–3 My arguments will be open to all, and may be judged of by all. – Publius v Contents 1 Introduction 1 1.1 Overview of this thesis ............................... 1 1.1.1 Data Compression as Learning ....................... 1 1.1.2 Visualization ................................ 3 1.1.3 Learning From the Web .......................... 5 1.1.4 Clustering and Classification ........................ 5 1.2 Gestalt Historical Context .............................. 5 1.3 Contents of this Thesis ............................... 9 2 Technical Introduction 11 2.1 Finite and Infinite .................................. 11 2.2 Strings and Languages ............................... 12 2.3 The Many Facets of Strings ............................. 13 2.4 Prefix Codes ..................................... 14 2.4.1 Prefix Codes and the Kraft Inequality ................... 15 2.4.2 Uniquely Decodable Codes ......................... 15 2.4.3 Probability Distributions and Complete Prefix Codes ........... 16 2.5 Turing Machines .................................. 16 2.6 Kolmogorov Complexity .............................. 18 2.6.1 Conditional Kolmogorov Complexity ................... 19 2.6.2 Kolmogorov Randomness and Compressibility .............. 20 2.6.3 Universality In K .............................. 21 2.6.4 Sophisticated Forms of K .......................... 21 2.7 Classical Probability Compared to K ........................ 21 2.8 Uncomputability of Kolmogorov Complexity ................... 23 2.9 Summary ...................................... 24 vii 3 Normalized Compression Distance (NCD) 25 3.1 Similarity Metric .................................. 25 3.2 Normal Compressor ................................. 28 3.3 Background in Kolmogorov complexity ...................... 30 3.4 Compression Distance ............................... 31 3.5 Normalized Compression Distance ......................... 32 3.6 Kullback-Leibler divergence and NCD ....................... 36 3.6.1 Static Encoders and Entropy ........................ 36 3.6.2 NCD and KL-divergence .......................... 38 3.7 Conclusion ..................................... 41 4 A New Quartet Tree Heuristic For Hierarchical 4.1 Summary ...................................... 43 4.2 Introduction ..................................... 44 4.3 Hierarchical Clustering ............................... 46 4.4 The Quartet Method ................................. 46 4.5 Minimum Quartet Tree Cost ............................ 48 4.5.1 Computational Hardness .......................... 49 4.6 New Heuristic .................................... 51 4.6.1 Algorithm .................................. 52 4.6.2 Performance ................................ 53 4.6.3 Termination Condition ........................... 55 4.6.4 Tree Building Statistics ........................... 56 4.6.5 Controlled Experiments .......................... 57 4.7 Quartet Topology Costs Based On Distance Matrix ................ 57 4.7.1 Distance Measure Used ........................... 58 4.7.2 CompLearn Toolkit ............................. 58 4.7.3 Testing The Quartet-Based Tree Construction ............... 59 4.8 Testing On Artificial Data .............................. 60 4.9 Testing On Heterogeneous Natural Data ...................... 61 4.10 Testing on Natural Data ............................... 62 4.10.1 Analyzing the SARS and H5N1 Virus Genomes .............. 62 4.10.2 Music .................................... 64 4.10.3 Mammalian Evolution ........................... 67 4.11 Hierarchical versus Flat Clustering ......................... 68 5 Classification systems using NCD 71 5.1 Basic Classification ................................. 71 5.1.1 Binary and Multiclass Classifiers ..................... 72 5.1.2 Naive NCD Classification ......................... 73 5.2 NCD With Trainable Classifiers .......................... 73 5.2.1 Choosing Anchors ............................. 74 5.3 Trainable Learners of Note ............................. 74 viii 5.3.1 Neural Networks .............................. 74 5.3.2 Support Vector Machines .......................... 75 5.3.3 SVM Theory ................................ 76 5.3.4 SVM Parameter Setting .......................... 77 6 Experiments with NCD 79 6.1 Similarity ...................................... 79 6.2 Experimental Validation .............................. 83 6.3 Truly Feature-Free: The Case of Heterogenous Data ................ 84 6.4 Music Categorization ................................ 85 6.4.1 Details of Our Implementation ....................... 86 6.4.2 Genres: Rock vs. Jazz vs. Classical .................... 86 6.4.3 Classical Piano Music (Small Set) ..................... 88 6.4.4 Classical Piano Music (Medium Set) .................... 89 6.4.5 Classical Piano Music (Large Set) ..................... 90 6.4.6 Clustering Symphonies ........................... 91 6.4.7 Future Music Work and Conclusions .................... 91 6.4.8 Details of the Music Pieces Used ...................... 92 6.5 Genomics and Phylogeny .............................. 93 6.5.1 Mammalian Evolution: ........................... 94 6.5.2 SARS Virus: ................................ 97 6.5.3 Analysis of Mitochondrial Genomes of Fungi: .............. 97 6.6 Language Trees ................................... 98 6.7 Literature ...................................... 99 6.8 Optical Character Recognition ...........................101 6.9 Astronomy .....................................102 6.10 Conclusion .....................................102 7 Automatic Meaning Discovery Using Google 105 7.1 Introduction .....................................105 7.1.1 Googling for Knowledge ..........................108 7.1.2 Related Work and Background NGD ...................108 7.1.3 Outline ...................................109 7.2 Extraction of Semantic Relations with Google ...................109 7.2.1 Genesis of the Approach ..........................110 7.3 Theory of Googling for Similarity .........................113 7.3.1 The Google Distribution: ..........................114 7.3.2 Google Semantics: .............................114 7.3.3 The Google Code: .............................115 7.3.4 The Google Similarity Distance: ......................115 7.3.5 Universality of Google Distribution: ....................116 7.3.6 Universality of Normalized Google Distance: ...............118 7.4 Introduction to Experiments ............................120 ix 7.4.1 Google Frequencies and Meaning .....................120 7.4.2 Some Implementation Details .......................121 7.4.3 Three Applications of the Google Method .................122 7.5 Hierarchical Clustering ...............................122 7.5.1 Colors and Numbers ............................122 7.5.2 Dutch 17th Century Painters ........................122 7.5.3 Chinese Names ...............................124 7.6 SVM Learning ...................................127 7.6.1 Emergencies ................................127 7.6.2 Learning Prime Numbers ..........................128 7.6.3 WordNet Semantics: Specific Examples ..................128 7.6.4 WordNet Semantics: Statistics .......................130 7.7 Matching the Meaning ...............................132 7.8 Conclusion .....................................133 8 Stemmatology 137 8.1 Introduction .....................................137 8.2 A Minimum-Information Criterion .........................140 8.3 An Algorithm for Constructing Stemmata .....................142 8.4 Results and Discussion ...............................143 8.5 Conclusions .....................................147 9 Comparison of CompLearn with PHYLIP 153 10 CompLearn Documentation 161 Bibliography 173 Index 183 11 Nederlands Samenvatting 195 12 Biography 199 x List of Figures 1.1 The evolutionary tree built from complete mammalian mtDNA sequences of 24 species, using the NCD matrix of Figure 4.14 on page 70 where it was used to illustrate a point of hierarchical clustering versus flat clustering. We have redrawn the tree from our output to agree better with the customary phylogeny tree format. The tree agrees exceptionally well with the NCD distance matrix: S(T) = 0.996. ................................... 2 1.2 Several people’s names, political parties, regions, and other Chinese names.