Applications of Information Theory to Machine Learning Jeremy Bensadon

Applications of Information Theory to Machine Learning Jeremy Bensadon To cite this version: Jeremy Bensadon. Applications of Information Theory to Machine Learning. Metric Geometry [math.MG]. Université Paris Saclay (COmUE), 2016. English. NNT : 2016SACLS025. tel-01297163 HAL Id: tel-01297163 https://tel.archives-ouvertes.fr/tel-01297163 Submitted on 3 Apr 2016 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. NNT: 2016SACLS025 These` de doctorat de L'UniversitéParis-Saclay préparéeàL'UniversitéParis-Sud Ecole´ Doctorale n°580 Sciences et technologies de l'information et de la communication Spécialité: Mathématiqueset informatique par Jérémy Bensadon Applications de la théoriede l'information àl'apprentissage statistique Thèseprésentéeet soutenue àOrsay, le 2 février2016 Composition du Jury : M. Sylvain Arlot Professeur, UniversitéParis-Sud Président du jury M. AurélienGarivier Professeur, UniversitéPaul Sabatier Rapporteur M. Tobias Glasmachers Junior Professor, Ruhr-UniversitätBochum Rapporteur M. Yann Ollivier Chargéde recherche, UniversitéParis-Sud Directeur de thèse Remerciements Cette thèseest le résultatde trois ans de travail, pendant lesquels j'ai côtoyé de nombreuses personnes qui m'ont aidé,soit par leurs conseils ou leurs idées, soit simplement par leur présence, àproduire ce document. Je tiens tout d'abord àremercier mon directeur de thèseYann Ollivier. Nos nombreuses discussions m'ont beaucoup apporté,et j'en suis toujours sorti avec les idéesplus claires qu'au départ. Je remercie également mes rapporteurs AurélienGarivier et Tobias Glas- machers pour leur relecture attentive de ma thèseet leurs commentaires, ainsi que Sylvain Arlot, qui m'a permis de ne pas me perdre dans la littérature sur la régressionet a acceptéde faire partie de mon jury. FrédéricBarbaresco a manifestéun vif intérêtpour la troisièmepartie de cette thèse,et a fourni de nombreuses références, parfois difficiles àtrouver. L'équipe TAO a étéun environnement de travail enrichissant et agréable, par la diversitéet la bonne humeur de ses membres, en particulier mes ca- marades doctorants. Un remerciement tout particulier aux secrétairesde l'équipe et de l'école doctorale Marie-Carol puis Olga, et Stéphanie, pour leur efficacité.Leur aide dans les différentes procéduresde réinscriptionet de soutenance a étéinestimable. Enfin, je voudrais surtout remercier ma famille et mes amis, en particulier mes parents et mes sœurs. Pour tout. Paris, le 27 janvier 2016 Jérémy Bensadon 3 Contents Introduction8 1 Regression............................. 8 2 Black-Box optimization: Gradient descents .......... 10 3 Contributions........................... 11 4 Thesis outline........................... 12 Notation 14 I Information theoretic preliminaries 15 1 Kolmogorov complexity 16 1.1 Motivation for Kolmogorov complexity............. 16 1.2 Formal Definition......................... 17 1.3 Kolmogorov complexity is not computable........... 18 2 From Kolmogorov Complexity to Machine Learning 20 2.1 Prefix-free Complexity, Kraft's Inequality ........... 20 2.2 Classical lower bounds for Kolmogorov complexity . 21 2.2.1 Coding integers...................... 21 2.2.2 Generic bounds...................... 22 2.3 Probability distributions and coding: Shannon encoding theorem................................ 23 2.3.1 Non integer codelengths do not matter: Arithmetic coding........................... 25 2.4 Model selection and Kolmogorov complexity.......... 28 2.5 Possible approximations of Kolmogorov complexity . 29 3 Universal probability distributions 31 3.1 Two-part codes.......................... 34 3.1.1 Optimal precision .................... 34 3.1.2 The i.i.d.case: confidence intervals and Fisher information........................... 35 3.1.3 Link with model selection................ 36 3.2 Bayesian models, Jeffreys’ prior................. 37 3.2.1 Motivation ........................ 37 4 3.2.2 Construction ....................... 38 3.2.3 Example: the Krichevsky{Trofimov estimator . 38 3.3 Context tree weighting...................... 41 3.3.1 Markov models and full binary trees.......... 41 3.3.2 Prediction for the family of visible Markov models . 42 3.3.3 Computing the prediction................ 42 3.3.3.1 Bounded depth................. 43 3.3.3.2 Generalization................. 45 3.3.4 Algorithm......................... 46 II Expert Trees 47 4 Expert trees: a formal context 50 4.1 Experts .............................. 51 4.1.1 General properties.................... 52 4.2 Operations with experts..................... 53 4.2.1 Fixed mixtures...................... 53 4.2.2 Bayesian combinations ................. 54 4.2.3 Switching......................... 55 4.2.3.1 Definition.................... 55 4.2.3.2 Computing some switch distributions: The forward algorithm............... 56 4.2.4 Restriction ........................ 61 4.2.5 Union........................... 61 4.2.5.1 Domain union ................. 61 4.2.5.2 Target union.................. 62 4.2.5.3 Properties ................... 63 4.3 Expert trees............................ 65 4.3.1 Context Tree Weighting................. 67 4.3.2 Context Tree Switching ................. 68 4.3.2.1 Properties ................... 69 4.3.3 Edgewise context tree algorithms............ 74 4.3.3.1 Edgewise Context Tree Weighting as a Bayesian combination .................. 75 4.3.3.2 General properties of ECTS ......... 76 4.3.4 Practical use ....................... 80 4.3.4.1 Infinite depth algorithms........... 80 4.3.4.1.1 Properties .............. 81 4.3.4.2 Density estimation............... 82 4.3.4.3 Text compression ............... 84 4.3.4.4 Regression ................... 84 5 Comparing CTS and CTW for regression 86 5 5.1 Local experts........................... 86 5.1.1 The fixed domain condition............... 86 5.1.2 Blind experts....................... 87 5.1.3 Gaussian experts..................... 87 5.1.4 Normal-Gamma experts................. 88 5.2 Regularization in expert trees.................. 89 5.2.1 Choosing the regularization............... 90 5.3 Regret bounds in the noiseless case............... 94 5.3.1 CTS............................ 94 5.3.2 ECTS........................... 96 5.3.3 CTW ...........................100 6 Numerical experiments 104 6.1 Regression.............................104 6.2 Text Compression ........................108 6.2.1 CTS in [VNHB11] ....................108 III Geodesic Information Geometric Optimization 110 7 The IGO framework 113 7.1 Invariance under Reparametrization of θ: Fisher Metric . 113 7.2 IGO Flow, IGO Algorithm ...................115 7.3 Geodesic IGO...........................116 7.4 Comparable pre-existing algorithms . 117 7.4.1 xNES ...........................117 7.4.2 Pure Rank-µ CMA-ES..................119 8 Using Noether's theorem to compute geodesics 121 8.1 Riemannian Geometry, Noether's Theorem . 121 ~ 8.2 GIGO in Gd ............................123 8.2.1 Preliminaries: PoincaréHalf-Plane, Hyperbolic Space 124 ~ 8.2.2 Computing the GIGO Update in Gd . 125 8.3 GIGO in Gd ............................126 8.3.1 Obtaining a First Order Differential Equation for the Geodesics of Gd . 126 8.3.2 Explicit Form of the Geodesics of Gd (from [CO91]) . 130 8.4 Using a Square Root of the Covariance Matrix . 131 9 Blockwise GIGO, twisted GIGO 133 9.1 Decoupling the step size.....................133 9.1.1 Twisting the Metric ...................133 9.1.2 Blockwise GIGO, an almost intrinsic description of xNES ...........................135 6 9.2 Trajectories of Different IGO Steps . 137 10 Numerical experiments 142 10.1 Benchmarking...........................142 10.1.1 Failed Runs........................143 10.1.2 Discussion.........................144 10.2 Plotting Trajectories in G1 . 145 Conclusion 151 1 Summary .............................151 1.1 Expert Trees .......................151 1.2 GIGO...........................151 2 Future directions.........................151 A Expert Trees 153 A.1 Balanced sequences........................154 A.1.1 Specific sequence achieveing the bound in Section 5.3.1 155 A.2 Loss of Normal{Gamma experts . 156 A.3 Pseudocode for CTW ......................162 B Geodesic IGO 164 B.1 Generalization of the Twisted Fisher Metric . 165 B.2 Twisted Geodesics ........................166 B.3 Pseudocodes............................168 B.3.1 For All Algorithms....................168 B.3.2 Updates..........................169 Bibliography 172 Index 178 7 Introduction Information theory has a wide range of applications. In this thesis, we focus on two different machine learning problems, for which information theoretical insight was useful. The first one is regression of a function f : X 7! Y . We are given a sequence (xi) and we want to predict the f(xi) knowing the f(xj) for j < i. We use techniques inspired by the Minimum Description Length principle to obtain a quick and robust algorithm for online regression. The second one is black box optimization. Black-box optimization con- sists

Applications of Information Theory to Machine Learning Jeremy Bensadon

Information Theory Techniques for Multimedia Data Classification and Retrieval

Combinatorial Reasoning in Information Theory

2020 SIGACT REPORT SIGACT EC – Eric Allender, Shuchi Chawla, Nicole Immorlica, Samir Khuller (Chair), Bobby Kleinberg September 14Th, 2020

Information Theory for Intelligent People

An Information-Theoretic Approach to Normal Forms for Relational and XML Data

1 Applications of Information Theory to Biology Information Theory Offers A

Efficient Space-Time Sampling with Pixel-Wise Coded Exposure For

Combinatorial Foundations of Information Theory and the Calculus of Probabilities

Information Theory and Communication - Tibor Nemetz

Wisdom - the Blurry Top of Human Cognition in the DIKW-Model? Anett Hoppe1 Rudolf Seising2 Andreas Nürnberger2 Constanze Wenzel3

The Deep Learning Solutions on Lossless Compression Methods for Alleviating Data Load on Iot Nodes in Smart Cities

Answers to Exercises