The Foundations of Solomonoff Prediction

The Foundations of Solomonoff Prediction MSc Thesis for the graduate programme in History and Philosophy of Science at the Universiteit Utrecht by Tom Florian Sterkenburg under the supervision of prof.dr. D.G.B.J. Dieks (Institute for History and Foundations of Science, Universiteit Utrecht) and prof.dr. P.D. Grünwald (Centrum Wiskunde & Informatica; Universiteit Leiden) February 2013 ii Parsifal: Wer ist der Gral? Gurnemanz: Das sagt sich nicht; doch bist du selbst zu ihm erkoren, bleibt dir die Kunde unverloren. – Und sieh! – Mich dunkt,¨ daß ich dich recht erkannt: kein Weg fuhrt¨ zu ihm durch das Land, und niemand könnte ihn beschreiten, den er nicht selber möcht’ geleiten. Parsifal: Ich schreite kaum, - doch wähn’ ich mich schon weit. Richard Wagner, Parsifal, Act I, Scene I iv Voor mum v Abstract R.J. Solomonoff’s theory of Prediction assembles notions from information theory, confirmation theory and computability theory into the specification of a supposedly all-encompassing objective method of prediction. The theory has been the subject of both general neglect and occasional passionate promotion, but of very little serious philosophical reflection. This thesis presents an attempt towards a more balanced philosophical appraisal of Solomonoff’s theory. Following an in-depth treatment of the mathematical framework and its motivation, I shift attention to the proper interpretation of these formal results. A discussion of the theory’s possible aims turns into the project of identifying its core principles, and a defence of the primacy of the unifying principle of Universality supports the development of my proposed interpretation of Solomonoff Prediction as the statement, to be read in the context of the philosophical problem of prediction, that in a universal setting, there exist universal predictors. The universality of the setting is grounded in the central assumption of computability: while this assumption is not uncontroversial as a constraint on the world, I argue that it is hardly a constraint at all if we restrict attention to all possible competing prediction methods. This is supported by a new, more refined convergence result. vii Acknowledgements Many thanks to Dennis and Peter for supervising a thesis that would have been quite unusual to both. If it wasn’t for your enthusiastic reception of my initial plan, I wouldn’t have pushed it. Thanks for answering many questions and for good discussion. Thanks also to Paul Vitányi for answering many questions and for good discussion. Thanks to George Barmpalias, Adam Day, Marcus Hutter, Panu Raatikainen, Jan-Willem Romeijn and Theo Kuipers for answering questions and for valuable sug- gestions. Thanks to fellow students Nick, Fedde and Abram for valuable discussion. Finally, thanks to my little sister for many things. Tom Sterkenburg Amsterdam, February 2013 ix Contents Acknowledgements ix Introduction1 Some Context and Sources..........................2 The Plan of This Thesis...........................4 0 Warming Up7 The MDL Principle.............................7 Solomonoff’s Theory of Prediction.....................8 A Different View of Solomonoff Prediction.................9 1 Solomonoff’s Theory of Prediction 11 1.1 The Setting.................................. 11 1.1.1 Inspiration from Information.................... 11 1.1.2 Plugging in Probabilities....................... 13 1.1.3 Connecting to Computability.................... 19 1.2 Algorithmic Probability........................... 22 1.2.1 A First Definition.......................... 22 1.2.2 A Second Definition......................... 24 1.2.3 The Third Definition: Algorithmic Probability.......... 28 1.3 The Universal Prior Distribution...................... 30 1.3.1 A Universal Mixture Distribution.................. 31 1.3.2 Algorithmic Probability and The Universal Mixture....... 33 1.4 Universal Prediction............................. 38 1.4.1 Prediction with Solomonoff..................... 38 1.4.2 Completeness............................. 39 2 The Principles of Solomonoff Prediction 45 2.1 The Purpose of Solomonoff Prediction................... 45 2.1.1 Method, Model and Theory..................... 46 2.1.2 The Method of Solomonoff Prediction............... 48 2.1.3 The Model of Solomonoff Prediction................ 50 2.1.4 The Theory of Solomonoff Prediction................ 51 2.2 The Candidate Core Principles....................... 54 2.2.1 Identification of the Candidates................... 54 2.2.2 Completeness............................. 56 2.2.3 Simplicity............................... 56 xi Contents 2.2.4 Universality.............................. 59 3 Universality of Solomonoff Prediction 63 3.1 The Encapsulation of Completeness.................... 63 3.1.1 Making the Method Work...................... 63 3.1.2 Making the Theory Work...................... 66 3.2 The Threat of Subjectivity......................... 67 3.2.1 The Threat.............................. 68 3.2.2 Taking the Threat Away....................... 69 3.3 The Questionable Role of Simplicity.................... 71 3.3.1 The Short Descriptions of Algorithmic Probability........ 71 3.3.2 The Weights of the Universal Mixture Distribution........ 78 3.3.3 Conclusion.............................. 81 4 Universality of the Model 83 4.1 Encodings................................... 83 4.1.1 Data and Binary Sequences..................... 83 4.1.2 The Language of Binary Sequences................. 85 4.2 Environments................................. 86 4.2.1 Generality of the Probabilistic Environments........... 87 4.2.2 Interpretation of Probabilities.................... 89 4.3 Effectiveness................................. 91 4.3.1 A Natural Restriction?........................ 91 4.3.2 Turing Computability in the Wider World............. 92 4.3.3 Effectiveness in Solomonoff Prediction............... 95 4.4 Predictors For Environments........................ 98 4.4.1 The Model of Predictors....................... 98 4.4.2 Universality of the Model of Predictors.............. 101 Conclusion 105 Bibliography 109 Symbol Index 123 Name Index 125 Subject Index 127 xii Introduction There are many engaging angles one can take to introduce a topic as profound as the topic of this thesis. (The skeptic – or simply the level-headed? – would perhaps rephrase: many ways of providing engaging context for a topic that is so theoretical as to appear altogether esoteric otherwise... The tone of this thesis will often be the tone of the skeptic, but let me start on a positive note.) One such perspective stresses the topic’s deep roots in the theoretical foundation of computation and information, and consequently its natural connection to the char- acteristic ideas and developments of the contemporary digital age. The aim of the Alan Turing Year 2012 was not only to commemorate Turing’s inception of computer science, as a feat of mathematical logic of the greatest theoretical importance, but also to reflect on the all-pervasive role, little over half a century later, of computers and computation in virtually every aspect of our lives and of society as a whole – including, of course, the practice of science. Sadly, the writing of this thesis took some time, and in the end it has missed its chance to join in the celebrations of the Turing year... Another vivid connection to current issues in science is established via the important role of statistical inference. A number of recent controversies, including some notable cases in our own country, have illustrated the need for a continuing look at both the practical application, and, crucially, the theoretical foundations of statistical theory. In fact, in the shadow of the main opposing schools of frequentism and Bayesianism, several less well-known attempts have been undertaken to come to a different and less problematic basis. An example of such an attempt is J.J. Rissanen’s principle of Minimum Description Length (MDL). The defining idea is to get rid of the necessity of any probabilistic assumptions on the world by reformulating the aim of statistical inference in terms of data compression. A main source of inspiration for the development of the MDL principle has been the 1960’s theory of prediction by R.J. Solomonoff. This theory presents a mathematical characterization of supposedly universally objective and optimal prediction. The MDL principle is regularly presented as a practical approximation of Solomonoff’s idealized theory. The precise (conceptual) relation between MDL and Solomonoff’s theory of Pre- diction is an issue that itself merits much more attention. An important step in this direction, and more in general an initial step towards an investigation of the foundations of MDL, should be an investigation of the foundations of the theory of Solomonoff Prediction (SP). That investigation is the topic of this thesis. 1 Introduction Some Context and Sources The following historical sketch is to give a first appreciation of the complex of ideas and efforts that are connected to the theory of Solomonoff. It also serves to introduce some of the main sources I relied on in the writing of this thesis. Inception of the theory Raymond J. Solomonoff (1926–2009) was driven by the quest for a general method to learn or discover scientific facts, a project that relates to the formation of the field of artificial intelligence (AI) at the time. He belonged to the select group of attendees of the famous Dartmouth Summer Research Project on Artificial Intelligence of 1956, that gave the field its name. Here Solomonoff circulated

The Foundations of Solomonoff Prediction

An Algorithmic Approach to Information and Meaning a Formal Framework for a Philosophical Discussion∗

A Review of Graph and Network Complexity from an Algorithmic Information Perspective

A Philosophical Treatise of Universal Induction

No Free Lunch Versus Occam's Razor in Supervised Learning

Algorithmic Data Analytics, Small Data Matters and Correlation Versus Causation Arxiv:1309.1418V9 [Cs.CE] 26 Jul 2017

Extending Universal Intelligence Models with Formal Notion of Representation

Universal Artificial Intelligence Philosophical, Mathematical, and Computational Foundations of Inductive Inference and Intelligent Agents the Learn

Algorithmic Probability-Guided Supervised Machine Learning on Non-Differentiable Spaces

Algorithmic Randomness As Foundation of Inductive Reasoning and Artificial Intelligence

Probability, Algorithmic Complexity, and Subjective Randomness

Algorithmic Information Dynamics - Scholarpedia Algorithmic Information Dynamics

Open Problems in Universal Induction & Intelligence