Lecture notes on descriptional complexity and randomness
Peter Gács Boston University [email protected]
A didactical survey of the foundations of Algorithmic Information The- ory. These notes introduce some of the main techniques and concepts of the field. They have evolved over many years and have been used and ref- erenced widely. “Version history” below gives some information on what has changed when.
Contents
Contents iii
1 Complexity1 1.1 Introduction ...... 1 1.1.1 Formal results ...... 3 1.1.2 Applications ...... 6 1.1.3 History of the problem...... 8 1.2 Notation ...... 10 1.3 Kolmogorov complexity ...... 11 1.3.1 Invariance ...... 11 1.3.2 Simple quantitative estimates ...... 14 1.4 Simple properties of information ...... 16 1.5 Algorithmic properties of complexity ...... 19 1.6 The coding theorem...... 24 1.6.1 Self-delimiting complexity...... 24 1.6.2 Universal semimeasure...... 27 1.6.3 Prefix codes ...... 28 1.6.4 The coding theorem for 퐾(푥) ...... 30 1.6.5 Algorithmic probability...... 31 1.7 The statistics of description length ...... 32
2 Randomness 37 2.1 Uniform distribution...... 37 2.2 Computable distributions...... 40 2.2.1 Two kinds of test...... 40 2.2.2 Randomness via complexity...... 41 2.2.3 Conservation of randomness...... 43 2.3 Infinite sequences ...... 46
iii Contents
2.3.1 Null sets ...... 47 2.3.2 Probability space...... 52 2.3.3 Computability ...... 54 2.3.4 Integral...... 54 2.3.5 Randomness tests ...... 55 2.3.6 Randomness and complexity...... 56 2.3.7 Universal semimeasure, algorithmic probability . . . . . 58 2.3.8 Randomness via algorithmic probability...... 60
3 Information 63 3.1 Information-theoretic relations...... 63 3.1.1 The information-theoretic identity...... 63 3.1.2 Information non-increase...... 72 3.2 The complexity of decidable and enumerable sets ...... 74 3.3 The complexity of complexity...... 78 3.3.1 Complexity is sometimes complex...... 78 3.3.2 Complexity is rarely complex ...... 79
4 Generalizations 83 4.1 Continuous spaces, noncomputable measures...... 83 4.1.1 Introduction ...... 83 4.1.2 Uniform tests...... 85 4.1.3 Sequences ...... 87 4.1.4 Conservation of randomness...... 88 4.2 Test for a class of measures ...... 90 4.2.1 From a uniform test ...... 90 4.2.2 Typicality and class tests...... 91 4.2.3 Martin-Löf’s approach...... 94 4.3 Neutral measure...... 97 4.4 Monotonicity, quasi-convexity/concavity ...... 101 4.5 Algorithmic entropy ...... 103 4.5.1 Entropy...... 104 4.5.2 Algorithmic entropy ...... 105 4.5.3 Addition theorem...... 106 4.5.4 Information...... 110 4.6 Randomness and complexity...... 111 4.6.1 Discrete space ...... 111 4.6.2 Non-discrete spaces ...... 113
iv Contents
4.6.3 Infinite sequences ...... 114 4.6.4 Bernoulli tests ...... 116 4.7 Cells ...... 118 4.7.1 Partitions...... 119 4.7.2 Computable probability spaces ...... 122
5 Exercises and problems 125
A Background from mathematics 131 A.1 Topology...... 131 A.1.1 Topological spaces...... 131 A.1.2 Continuity ...... 133 A.1.3 Semicontinuity...... 134 A.1.4 Compactness...... 135 A.1.5 Metric spaces...... 136 A.2 Measures...... 141 A.2.1 Set algebras ...... 141 A.2.2 Measures...... 141 A.2.3 Integral...... 144 A.2.4 Density...... 145 A.2.5 Random transitions...... 146 A.2.6 Probability measures over a metric space ...... 147
B Constructivity 155 B.1 Computable topology ...... 155 B.1.1 Representations ...... 155 B.1.2 Constructive topological space ...... 156 B.1.3 Computable functions ...... 158 B.1.4 Computable elements and sequences ...... 159 B.1.5 Semicomputability...... 160 B.1.6 Effective compactness ...... 161 B.1.7 Computable metric space ...... 163 B.2 Constructive measure theory...... 166 B.2.1 Space of measures ...... 166 B.2.2 Computable and semicomputable measures ...... 168 B.2.3 Random transitions...... 169
Bibliography 171
v Contents
Version history
May 2021: changed the notation to the now widely accepted one: Kolmogorov’s original complexity is denoted 퐶(푥) (instead of the earlier 퐾(푥)), while the pre- fix complexity is denoted 퐾(푥) instead of the earlier 퐻(푥). Corrected a remark on the notation 푎표. June 2013: corrected a slight mistake in the section on the section on ran- domness via algorithmic probability. February 2010: chapters introduced, and recast using the memoir class. April 2009: besides various corrections, a section is added on infinite sequen- ces. This is not new material, just places the most classical results on randomness of infinite sequences before the more abstract theory. January 2008: major rewrite. • Added formal definitions throughout. • Made corrections in the part on uniform tests and generalized complexity, based on remarks of Hoyrup, Rojas and Shen. • Rearranged material. • Incorporated material on uniform tests from the work of Hoyrup-Rojas. • Added material on class tests.
vi 1 Complexity
Je n’ai fait celle-ci plus longue que parce que je n’ai pas eu le loisir de la faire plus courte.
Pascal
Ainsi, au jeu de croix ou pile, l’arrivée de croix cent fois de suite, nous paraît extraordinaire; parce que le nombre presque infini des combinaisons qui peuvent arriver en cent coups étant partagé en séries regulières, ou dans lesquelles nous voyons régner un ordre facile à saisir, et en séries irregulières; celles-ci sont incomparablement plus nombreuses.
Laplace
1.1 Introduction
The present section can be read as an independent survey on the problems of randomness. It serves as some motivation for the dryer stuff to follow. If we toss a coin 100 times and it shows each time Head, we feel lifted to a land of wonders, like Rosencrantz and Guildenstern in [49]. The argu- ment that 100 heads are just as probable as any other outcome convinces us only that the axioms of Probability Theory, as developed in [28], do not solve
1 1. Complexity all mysteries they are sometimes supposed to. We feel that the sequence con- sisting of 100 heads is not random, though others with the same probability are. Due to the somewhat philosophical character of this paradox, its history is marked by an amount of controversy unusual for ordinary mathematics. Before the reader hastes to propose a solution, let us consider a less trivial example, due to L. A. Levin. Suppose that in some country, the share of votes for the ruling party in 30 consecutive elections formed a sequence 0.99푥푖 where for every even 푖, the num- ber 푥푖 is the 푖-th digit of 휋 = 3.1415 .... Though many of us would feel that the election results were manipulated, it turns out to be surprisingly difficult to prove this by referring to some general principle. In a sequence of 푛 fair elections, every sequence 휔 of 푛 digits has approxi- −푛 mately the probability 푄푛 (휔) = 10 to appear as the actual sequence of third digits. Let us fix 푛. We are given a particular sequence 휔 and want to test the validity of the government’s claim that the elections were fair. We interpret the assertion “휔 is random with respect to 푄푛” as a synonym for “there is no reason to challenge the claim that 휔 arose from the distribution 푄푛”. How can such a claim be challenged at all? The government, just like the weather forecaster who announces 30% chance of rain, does not guarantee any particular set of outcomes. However, to stand behind its claim, it must agree to any bet based on the announced distribution. Let us call a payoff function with Í respect the distribution 푃 any nonnegative function 푡(휔) with 휔 푃(휔)푡(휔) ≤ 1. If a “nonprofit” gambling casino asks 1 dollar for a game and claims that each outcome has probability 푃(휔) then it must agree to pay 푡(휔) dollars on outcome 휔. We would propose to the government the following payoff function 푡0 with 푛/2 respect to 푄푛: let 푡0 (휔) = 10 for all sequences 휔 whose even digits are given by 휋, and 0 otherwise. This bet would cost the government 10푛/2 − 1 dollars. Unfortunately, we must propose the bet before the elections take place and it is unlikely that we would have come up exactly with the payoff function 푡0. Is then the suspicion unjustifiable? No. Though the function 푡0 is not as natural as to guess it in advance, it is still highly “regular”. And already Laplace assumed in [15] that the number of “regular” bets is so small we can afford to make them all in advance and still win by a wide margin. Kolmogorov discovered in [27] and [29] (probably without knowing about [15] but following a four decades long controversy on von Mises’ con- cept of randomness, see [53]) that to make this approach work we must define “regular” or “simple” as “having a short description” (in some formal sense to
2 1.1. Introduction be specified below). There cannot be many objects having a short description because there are not many short strings of symbols to be used as descriptions. We thus come to the principle saying that on a random outcome, all suffi- ciently simple payoff functions must take small values. It turns out below that this can be replaced by the more elegant principle saying that a random out- come itself is not too simple. If descriptions are written in a 2-letter alphabet then a typical sequence of 푛 digits takes 푛 log 10 letters to describe (if not stated otherwise, all logarithms in these notes are to the base 2). The digits of 휋 can be generated by an algorithm whose description takes up only some constant length. Therefore the sequence 푥1 . . . 푥푛 above can be described with approx- imately (푛/2) log 10 letters, since every other digit comes from 휋. It is thus significantly simpler than a typical sequence and can be declared nonrandom.
1.1.1 Formal results The theory of randomness is more impressive for infinite sequences than for finite ones, since sharp distinction can be made between random and nonrandom in- finite sequences. For technical simplicity, first we will confine ourselves to finite sequences, especially a discrete sample space Ω, which we identify with the set of natural numbers. Binary strings will be identified with the natural numbers they denote. Definition 1.1.1 A Turing machine is an imaginary computing device consisting of the following. A control state belonging to a finite set 퐴 of possible control states. A fixed number of infinite (or infinitely extendable) strings of cells called tapes. Each cell contains a symbol belonging to a finite tape alphabet 퐵. On each tape, there is a read-write head observing one of the tape cells. The machine’s configuration (global state) is determined at each instant by giving its control state, the contents of the tapes and the positions of the read-write heads. The “hardware program” of the machine determines its configuration in the next step as a functon of the control state and the tape symbols observed. It can change the control state, the content of the observed cells and the position of the read- write heads (the latter by one step to the left or right). Turing machines can emulate computers of any known design (see for example [60]). The tapes are used for storing programs, input data, output and as memory. y The Turing machines that we will use to interpret descriptions will be some- what restricted. Definition 1.1.2 Consider a Turing machine 퐹 which from a binary string 푝
3 1. Complexity and a natural number 푥 computes the output 퐹( 푝, 푥) (if anything at all). We will say that 퐹 interprets 푝 as a description of 퐹( 푝, 푥) in the presence of the side information 푥. We suppose that if 퐹( 푝, 푥) is defined then 퐹(푞, 푥) is not defined for any prefix 푞 of 푝. Such machines are called self-delimiting (s.d.). The conditional complexity 퐾퐹 (푥 | 푦) of the number 푥 with respect to the number 푦 is the length of the shortest description 푝 for which 퐹( 푝, 푦) = 푥. y
Kolmogorov and Solomonoff observed that the function 퐾퐹 (푥 | 푦) depends only weakly on the machine 퐹, because there are universal Turing machines capable of simulating the work of any other Turing machine whose description is supplied. More formally, the following theorem holds: Theorem 1.1.1 (Invariance Theorem) There is a s.d. Turnig machine 푇 such that for any s.d. machine 퐹 a constant 푐퐹 exists such that for all 푥, 푦 we have 퐾푇 (푥 | 푦) ≤ 퐾퐹 (푥 | 푦) + 푐퐹. This theorem motivates the following definition.
Definition 1.1.3 Let us fix 푇 and define 퐾(푥 | 푦) = 퐾푇 (푥 | 푦) and 퐾(푥) = 퐾(푥 | 0). y The function 퐾(푥 | 푦) is not computable. We can compute a nonincreas- ing, convergent sequence of approximations to 퐾(푥) (it is semicomputable from above), but will not know how far to go in this sequence for some prescribed accuracy. If 푥 is a binary string of length 푛 then 퐾(푥) ≤ 푛 + 2 log 푛 + 푐 for some constant 푐. The description of 푥 with this length gives 푥 bit-for-bit, along with some information of length 2 log 푛 which helps the s.d. machine find the end of the description. For most binary strings of lenght 푛, no significantly shorter description exists, since the number of short descriptions is small. Below, this example is generalized and sharpened for the case when instead of counting the number of simple sequences, we measure their probability. Definition 1.1.4 Denote by 푥∗ the first one among the shortest descriptions of 푥. y The correspondence 푥 → 푥∗ is a code in which no codeword is the prefix of another one. This implies by an argument well-known in Information Theory the inequality ∑︁ 2−퐾 (푥 | 푦) ≤ 1, (1.1.1) 푥 hence only a few objects 푥 can have small complexity. The converse of the same argument goes as follows. Let 휇 be a computable probability distribution, one for
4 1.1. Introduction which there is a binary program computing 휇(휔) for each 휔 to any given degree of accuracy. Let 퐾(휇) be the length of the shortest one of these programs. Then
퐾(휔) ≤ − log 휇(휔) + 퐾(휇) + 푐. (1.1.2)
Here 푐 is a universal constant. These two inequalities are the key to the estimates of complexity and the characterization of randomness by complexity. Denote 푑휇 (휔) = − log 휇(휔) − 퐾(휔). Inequality (1.1.1) implies 푑휇 (휔) 푡휇 (휔) = 2 can be viewed as a payoff function. Now we are in a position to solve the election paradox. We propose the payoff function
2− log 푄푛 (휔)−퐾 (휔 | 푛) to the government. (We used the conditional complexity 퐾(휔 | 푛) because the uniform distribution 푄푛 depends on 푛.) If every other digit of the outcome 푥 푡(푥) comes from 휋 then 퐾(푥 | 푛) ≤ (푛/2) log 10 + 푐0 hence we win a sum 2 ≥ 푛/2 푐110 from the government (for some constants 푐0, 푐1 > 0), even though the bet does not contain any reference to the number 휋. The fact that 푡휇 (휔) is a payoff function implies by Markov’s Inequality for any 푘 > 0 휇{휔 : 퐾(휔) < − log 휇(휔) − 푘} < 2−푘. (1.1.3) Inequalities (1.1.2) and (1.1.3) say that with large probability, the complexity 퐾(휔) of a random outcome 휔 is close to its upper bound − log 휇(휔)+퐾(휇). This law occupies distinguished place among the “laws of probability”, because if the outcome 휔 violates any such law, the complexity falls far below the upper bound. Indeed, a proof of some “law of probability” (like the law of large numbers, the law of iterated logarithm, etc.) always gives rise to some simple computable payoff function 푡(휔) taking large values on the outcomes violating the law, just as in the election example. Let 푚 be some large number, suppose that 푡(휔) has 푚 complexity < 푚/2, and that 푡(휔0) > 2 . Then inequality (1.1.2) can be applied to 휈(휔) = 휇(휔)푡(휔), and we get
퐾(휔) ≤ − log 휇(휔) − 푚 + 퐾(휈) + 푐0
≤ − log 휇(휔) − 푚/2 + 퐾(휇) + 푐1 for some constants 푐0, 푐1.
5 1. Complexity
More generally, the payoff function 푡휇 (휔) is maximal (up to a multiplicative constant) among all payoff functions that are semicomputable (from below). Hence the quantity − log 휇(휔) − 퐾(휔) is a universal test of randomness. Its value measures the deficiency of randomness in the outcome 휔 with respect to the distribution 휇, or the extent of justified suspicion against the hypothesis 휇 given the outcome 휔.
1.1.2 Applications Algorithmic Information Theory (AIT) justifies the intuition of random sequences as nonstandard analysis justifies infinitely small quantities. Any statement of classical probability theory is provable without the notion of randomness, but some of them are easier to find using this notion. Due to the incomputability of the universal randomness test, only its approximations can be used in practice. Pseudorandom sequences are sequences generated by some algorithm, with some randomness properties with respect to the coin-tossing distribution. They have very low complexity (depending on the strength of the tests they withstand, see for example [14]), hence are not random. Useful pseudorandom sequences can be defined using the notion of computational complexity, for example the number of steps needed by a Turing machine to compute a certain function. The existence of such sequences can be proved using some difficult unproven (but plausible) assumptions of computation theory. See [6], [59], [24], [34].
Inductive inference The incomputable “distribution” m(휔) = 2−퐾 (휔) has the remarkable property that, the test 푑(휔 | m), shows all outcomes 휔 “random” with respect to it. Rela- tions (1.1.2) and (1.1.3) can be read as saying that if the real distribution is 휇 then 휇(휔) and m(휔) are close to each other with large probability. Therefore if we know that 휔 comes from some unknown simple distribution 휇 then we can use m(휔) as an estimate of 휇(휔). This suggests to call m the “apriori proba- bility” (but we will not use this term much). The randomness test 푑휇 (휔) can be interpreted in the framework of hypothesis testing: it is the likelihood ratio between the hypothesis 휇 and the fixed alternative hypothesis m. In ordinary statistical hypothesis testing, some properties of the unknown distribution 휇 are taken for granted. The sample 휔 is generally a large indepen- dent sample: 휇 is supposed to be a product distribution. Under these conditions, the universal test could probably be reduced to some of the tests used in statis- tical practice. However, these conditions do not hold in other applications: for
6 1.1. Introduction example testing for independence in a proposed random sequence, predicting some time series of economics, or pattern recognition. If the apriori probability m is a good estimate of the actual probability then we can use the conditional apriori probability for prediction, without reference to the unknown distribution 휇. For this purpose, one first has to define apriori probability for the set of infinite sequences of natural numbers, as done in [61]. We denote this function by 푀. For any finite sequence 푥, the number 푀(푥) is the apriori probability that the outcome is some extension of 푥. Let 푥, 푦 be finite sequences. The formula 푀(푥 푦) (1.1.4) 푀(푥) is an estimate of the conditional probability that the next terms of the outcome will be given by 푦 provided that the first terms are given by 푥. It converges to the actual conditional probability 휇(푥 푦)/휇(푥) with 휇-probability 1 for any computable distribution 휇 (see for example [46]). To understand the surprising generality of the formula (1.1.4), suppose that some infinite sequence 푧(푖) is given in the following way. Its even terms are the subsequent digits of 휋, its odd terms are uniformly distributed, independently drawn random digits. Let 푧(1 : 푛) = 푧(1)··· 푧(푛). Then 푀(푧(1 : 2푖)푎/푀(푧(1 : 2푖)) converges to 0.1 for 푎 = 0,..., 9, while 푀(푧(1 : 2푖 + 1)푎)/푀(푧(1 : 2푖 + 1)) converges to 1 if 푎 is the 푖-th digit of 휋, and to 0 otherwise. The inductive inference formula using conditional apriori probability can be viewed as a mathematical form of “Occam’s Razor”: the advice to predict by the simplest rule fitting the data. It can also viewed as a realization of Bayes’Rule, with a universally applicable apriori distribution. Since the distribution 푀 is in- computable, we view the main open problem of inductive inference to find max- imally efficient approximations to it. Sometimes, even a simple approximation gives nontrivial results (see [3]).
Information theory Since with large probability, 퐾(휔) is close to − log 휇(휔), the entropy Í − 휔 휇(휔) log 휇(휔) of the distribution 휇 is close to the average complexity Í 휔 휇(휔)퐾(휔). The complexity 퐾(푥) of an object 푥 can indeed be interpreted as the distribution-free definition of information content. The correspondence 푥 ↦→ 푥∗ is a sort of universal code: its average (even individual) “rate”, or code- word length is almost equally good for any simple computable distribution.
7 1. Complexity
It is of great conceptual help to students of statistical physics that entropy can be defined now not only for ensembles (probability distributions), but for individual states of a physical system. The notion of an individual incompressible sequence, a sequence whose complexity is maximal, proved also extremely useful in finding information-theoretic lower bounds on the computing speed of certain Turing machines (see [39]). The conditional complexity 퐾(푥 | 푦) obeys identities analogous to the information-theoretical identities for conditional entropy, but these identities are less trivial to prove in AIT. The information 퐼(푥 : 푦) = 퐾(푥) + 퐾( 푦) − 퐾(푥, 푦) has several interpretations. It is known that 퐼(푥 : 푦) is equal, to within an additive constant, to 퐾( 푦) − 퐾( 푦 | 푥∗), the amount by which the object 푥∗ (as defined in the previous section) decreases our uncertainty about 푦. But it can be written as − log m2 (푥, 푦) − 퐾(푥, 푦) = 푑((푥, 푦) | m2) where m2 = m × m is the product distribution of m with itself. It is thus the deficiency of randomness of the pair (푥, 푦) with respect to this product distribution. Since any object is random with respect to m, we can view the randomness of the pair (푥, 푦) with respect to the product m2 as the independence of 푥 and 푦 from each other. Thus “information” measures the “deficiency of independence”.
Logic Some theorems of Mathematical Logic (in particular, Gödel’s theorem) have a strong quantitative form in AIT, with new philosophical implications (see [8], [9], [31], [33]). Levin based a new system of intuitionistic analysis on his Inde- pendence Principle (see below) in [33].
1.1.3 History of the problem P. S. Laplace thought that the number of “regular” sequences (whatever “regular” means) is much smaller than the number of irregular ones (see [15]). In the first attempt at formalization hundred years later, R. von Mises defined an infinite binary sequence as random (a “Kollektiv”) if the relative frequencies converge in any subsequence selected according to some (non-anticipating) “rule” (whatever “rule” means, see [53]). As pointed out by A. Wald and others, Mises’s definitions are sound only if a countable set of possible rules is fixed. The logician A. Church, in accordance with his famous thesis, proposed to understand “rule” here as “recursive (computable) function”. The Mises selection rules can be considered as special randomness tests. In the ingenious work [52], J. Ville proved that they do not capture all relevant
8 1.1. Introduction properties of random sequences. In particular, a Kollektiv can violate the law of iterated logarithm. He proposed to consider arbitrary payoff functions (a count- able set of them), as defined on the set of infinite sequences—these are more commonly known as martingales. For the solution of the problem of inductive inference, R. Solomonoff intro- duced complexity and apriori probability in [45] and proved the Invariance The- orem. A. N. Kolmogorov independently introduced complexity as a measure of individual information content and randomness, and proved the Invariance The- orem (see [27] and [29]). The incomputability properties of complexity have noteworthy philosophical implications (see [4], [8], [9]. P. Martin-Löf defined in [38] randomness for infinite sequences. His concept is essentially the synthesis of Ville and Church (as noticed in [41]). He rec- ognized the existence of a universal test, and pointed out the close connection between the randomness of an infinite sequence and the complexity of its initial segments. L. A. Levin defined the apriori probability 푀 as a maximal (to within a multiplicative constant) semicomputable measure. With the help of a modi- fied complexity definition, he gave a simple and general characterization of random sequences by the behavior of the complexity of their initial segments (see [61], [30]). In [17] and [31], the information-theoretical properties of the self-delimiting complexity (as defined above) are exactly described. See also [10], [42] and [58]. In [33], Levin defined a deficiency of randomness 푑(휔 | 휇) in a uniform man- ner for all (computable or incomputable) measures 휇. He proved that all out- comes are random with respect to the apriori probability 푀. In this and earlier papers, he also proved the Law of Information Conservation, stating that the information 퐼(훼; 훽) in a sequence 훼 about a sequence 훽 cannot be significantly increased by any algorithmic processing of 훼 (even using random number gener- ators). He derived this law from a so-called Law of Randomness Conservation via the definition of information 퐼(훼; 훽) as deficiency of randomness with respect to the product distribution 푀2. Levin suggested the Independence Principle saying that any sequence 훼 arising in nature contains only finite information 퐼(훼; 훽) about any sequence 훽 defined by mathematical means. With this principle, he showed that the use of more powerful notions of definability in randomness tests (or the notion of complexity) does not lead to fewer random sequences among those arising in nature. The monograph [16] is very useful as background information on the vari- ous attempts in this century at solving the paradoxes of probability theory. The
9 1. Complexity work [33] is quite comprehensive but very dense; it can be recommended only to devoted readers. The work [61] is comprehensive and readable but not quite up-to-date. The surveys [12], [41] and [43] can be used to complement it. The most up-to-date and complete survey, which subsumes most of these notes, is [36]. AIT created many interesting problems of its own. See for example [10], [11], [17], [18], [19], [20], [35], [33], [37], [41], [44], [47], and the techni- cally difficult results [48] and [55].
Acknowledgment The form of exposition of the results in these notes and the general point of view represented in them were greatly influenced by Leonid Levin. More recent com- munication with Paul Vitányi, Mathieu Hoyrup, Cristóbal Rojas and Alexander Shen has also been very important.
1.2 Notation
When not stated otherwise, log means base 2 logarithm. The cardinality of a set 퐴 will be denoted by |퐴|. (Occasionally there will be inconsistency, sometimes denoting it by |퐴|, sometimes by #퐴.) If 퐴 is a set then 1퐴 (푥) is its indicator function: ( 1 if 푥 ∈ 퐴, 1퐴 (푥) = 0 otherwise.
The empty string is denoted by Λ. The set 퐴∗ is the set of all finite strings of elements of 퐴, including the empty string. Let 푙(푥) denote the length of string 푥. (Occasionally there will be inconsistency, sometimes denoting it by |푥|, sometimes by 푙(푥)). For sequences 푥 and 푦, let 푥 v 푦 denote that 푥 is a prefix of 푦. For a string 푥 and a (finite or infinite) sequence 푦, we denote their concatenation by 푥 푦. For a sequence 푥 and 푛 ≤ 푙(푥), the 푛-th element of 푥 is 푥(푛), and 푥(푖 : 푗) = 푥(푖)··· 푥( 푗). Sometimes we will also write
푥 ≤푛 = 푥(1 : 푛).
10 1.3. Kolmogorov complexity
The string 푥1 ··· 푥푛 will sometimes be written also as (푥1, . . . , 푥푛). For natural number 푚 let 훽(푚) be the binary sequence denoting 푚 in the binary notation. We denote by 푋N the set of all infinite sequences of elements of 푋. The sets of natural numbers, integers, rational numbers, real numbers and complex numbers will be denoted respectively by N, Z, Q, R, C. The set of non- negative real numbers will be denoted by R+. The set of real numbers with −∞, ∞ added (with the appropriate topology making it compact) will be de- noted by R. Let
∗ ∗ S푟 = {0, . . . , 푟 − 1} , S = N , B = {0, 1}.
We use ∧ and ∨ to denote min and max, further
|푥|+ = 푥 ∨ 0, |푥|− = | − 푥|+ for real numbers 푥. Let h·i be some standard one-to-one encoding of N∗ to N, with partial in- verses [·]푖 where [h푥i]푖 = 푥(푖) for 푖 ≤ 푙(푥). For example, we can have 1 h푖, 푗i = (푖 + 1)(푖 + 푗 + 1) + 푗, h푛1, . . . , 푛푘+1i = hh푛1, . . . , 푛푘i, 푛푘+1i. 2 We will use h·, ·i in particular as a pairing function over N. Similar pairing functions will be assumed over other domains, and they will also be denoted by h·, ·i. Another use of the notation (·, ·) may arise when the usual notation (푥, 푦) of an ordered pair of real numbers could be confused with the same notation of an open interval. In this case, we will use (푥, 푦) to denote the pair. The relations + ∗ 푓 < 푔, 푓 < 푔 mean inequality to within an additive constant and multiplicative constant re- spectively. The first is equivalent to 푓 ≤ 푔 + 푂(1), the second to 푓 = 푂(푔). The ∗ ∗ relation 푓 =∗ 푔 means 푓 < 푔 and 푓 > 푔.
1.3 Kolmogorov complexity
1.3.1 Invariance It is natural to try to define the information content of some text as the size of the smallest string (code) from which it can be reproduced by some decoder,
11 1. Complexity interpreter. We do not want too much information “to be hidden in the decoder” we want it to be a “simple” function. Therefore, unless stated otherwise, we require that our interpreters be computable, that is partial recursive functions.
Definition 1.3.1 A partial recursive function 퐴 from S2 × S to S will be called a (binary) interpreter, or decoder. We will often refer to its first argument as the program, or code. y Partial recursive functions are relatively simple; on the other hand, the class of partial recursive functions has some convenient closure properties. Definition 1.3.2 For any binary interpreter 퐴 and strings 푥, 푦 ∈ S, the number
퐶퐴 (푥 | 푦) = min{푙( 푝) : 퐴( 푝, 푦) = 푥 } is called the conditional Kolmogorov-complexity of 푥 given 푦, with respect to the interpreter 퐴 . (If the set after the “min” is empty, then the minimum is ∞). We write 퐶퐴 (푥) = 퐶퐴 (푥 | Λ). y
The number 퐶퐴 (푥) measures the length of the shortest description for 푥 when the algorithm 퐴 is used to interpret the descriptions. The value 퐶퐴 (푥) depends, of course, on the underlying function 퐴. But, as Theorem 1.3.1 shows, if we restrict ourselves to sufficiently powerful inter- preters 퐴, then switching between them can change the complexity function only by amounts bounded by some additive constant. Therefore complexity can be considered an intrinsic characteristic of finite objects. Definition 1.3.3 A binary p.r. interpreter 푈 is called optimal if for any binary p.r. interpreter 퐴 there is a constant 푐 < ∞ such that for all 푥, 푦 we have
퐶푈 (푥 | 푦) ≤ 퐶퐴 (푥 | 푦) + 푐. (1.3.1)
y Theorem 1.3.1 (Invariance Theorem) There is an optimal p.r. binary interpreter.
Proof. The idea is to use interpreters that come from universal partial recursive functions. However, not all such functions can be used for universal interpreters. Let us introduce an appropriate pairing function. For 푥 ∈ B푛, let
푥표 = 푥(1)0푥(2)0 . . . 푥(푛 − 1)0푥(푛)1 where 푥표 = 푥1 for 푙(푥) = 1. Any binary sequence 푥 having at least one 1 in an even position can be represented, uniquely, in the form 푥 = 푎표푏.
12 1.3. Kolmogorov complexity
We know there is a p.r. function 푉 : S2 × S2 × S → S that is universal: for any p.r. binary interpreter 퐴, there is a string 푎 such that for all 푝, 푥, we have 퐴( 푝, 푥) = 푉 (푎, 푝, 푥). Let us define the function 푈( 푝, 푥) as follows. We represent the string 푝 in the form 푝 = 푢표푣 and define 푈( 푝, 푥) = 푉 (푢, 푣, 푥). Then 푈 is a p.r. interpreter. Let us verify that it is optimal. Let 퐴 be a p.r. binary interpreter, 푎 a binary string such that 퐴( 푝, 푥) = 푈(푎, 푝, 푥) for all 푝, 푥. Let 푥, 푦 be two strings. If 퐶퐴 (푥 | 푦) = ∞, then (1.3.1) holds trivially. Otherwise, let 푝 be a binary string of length 퐶퐴 (푥 | 푦) with 퐴( 푝, 푦) = 푥. Then we have
푈(푎표 푝, 푦) = 푉 (푎, 푝, 푦) = 퐴( 푝, 푦) = 푥, and 퐶푈 (푥 | 푦) ≤ 2푙(푎) + 퐶퐴 (푥 | 푦). The constant 2푙(푎) is in the above proof a bound on the complexity of de- scription of the interpreter 퐴 for the optimal interpreter 푈. Let us note that for any two optimal interpreters 푈 (1) , 푈 (2) , there is a constant 푐 such that for all 푥, 푦, we have
|퐶푈 (1) (푥 | 푦) − 퐶푈 (2) (푥 | 푦)| < 푐. (1.3.2)
Hence the complexity 퐶푈 (푥) of description of an object 푥 does not depend strongly on the interpreter 푈. Still, for every string 푥, there is an optimal interpreter 푈 with 퐶푈 (푥) = 0. Imposing a universal bound on the table size of the Turing machines used to implement optimal interpreters, we can get a universal bound on the constants in (1.3.2). The theorem motivates the following definition. Definition 1.3.4 We fix an optimal binary p.r. interpreter 푈 and write 퐶(푥 | 푦) for 퐶푈 (푥 | 푦). y Theorem 1.3.1 (as well as other invariance theorems) is used in AIT for much more than just to show that 퐶퐴 (푥) is a proper concept. It is the principal tool to find upper bounds on 퐶(푥); this is why most such upper bounds are proved to hold only to within an additive constant. The optimal interpreter 푈( 푝, 푥) defined in the proof of Theorem 1.3.1 is obvi- ously a universal partial recursive function. Because of its convenient properties we will use it from now on as our standard universal p.r. function, and we will refer to an arbitrary p.r. function as 푈푝 (푥) = 푈( 푝, 푥).
13 1. Complexity
Definition 1.3.5 We define 푈푝 (푥1, . . . , 푥푘) = 푈푝 (h푥1 . . . , 푥푘i). Similarly, we will refer to an arbitrary computably enumerable set as the range 푊푝 of some 푈푝. We often do not need the second argument of the function 푈( 푝, 푥). We therefore define 푈( 푝) = 푈( 푝, Λ). y It is of no consequence that we chose binary strings as as descriptions. It is rather easy to define, (Exercise) for any two natural numbers 푟, 푠, a standard 푟 encoding cnv푠 of base 푟 strings 푥 into base 푠 strings with the property log 푟 푙(cnv푟 (푥)) ≤ 푙(푥) + 1. 푠 log 푠
Now, with 푟-ary strings as descriptions, we must define 퐶퐴 (푥) as the minimum of 푙( 푝) log 푟 over all programs 푝 in S푟 with 퐴( 푝) = 푥. The equivalence of the definitions for different values of 푟 is left as an exercise.
1.3.2 Simple quantitative estimates We found it meaningful to speak about the information content, complexity of a finite object. But how to compute this quantity? It turns out that complexity is not a computable function, but much is known about its statistical behavior. The following notation will be useful in the future. + Definition 1.3.6 The relation 푓 < 푔 means inequality to within an additive constant, that there is a constant 푐 such that for all 푥, 푓 (푥) ≤ 푔(푥). We can write this also as 푓 ≤ 푔 + 푂(1). We also say that 푔 additively dominates 푓 . The + + ∗ relation 푓 =+ 푔 means 푓 < 푔 and 푓 > 푔. The relation 푓 < 푔 among nonnegative functions means inequality to within a multiplicative constant. It is equivalent + to log 푓 < log 푔 and 푓 = 푂(푔). We also say that 푔 multiplicatively dominates 푓 . ∗ ∗ The relation 푓 =∗ 푔 means 푓 < 푔 and 푓 > 푔. y With the above notation, here are the simple properties. Theorem 1.3.2 The following simple upper bound holds. a) For any natural number 푚, we have
+ 퐶(푚) < log 푚. (1.3.3)
b) For any positive real number 푣 and string 푦, every finite set 퐸 of size 푚 has at least 푚(1 − 2−푣+1) elements 푥 with 퐶(푥 | 푦) ≥ log 푚 − 푣.
14 1.3. Kolmogorov complexity
Corollary 1.3.7 lim푛→∞ 퐶(푛) = ∞. Theorem 1.3.2 suggests that if there are so few strings of low complexity, then 퐶(푛) converges fast to ∞. In fact this convergence is extremely slow, as shown in a later section. Proof of Theorem 1.3.2. First we prove (a). Let the interpreter 퐴 be defined such + that if 훽(푚) = 푝 then 퐴( 푝, 푦) = 푚. Since |훽(푚)| = dlog 푚e, we have 퐶(푚) < 퐶퐴 (푚) < log 푚 + 1. Part (b) says that the trivial estimate (1.3.3) is quite sharp for most numbers under 푚. The reason for this is that since there are only few short programs, there are only few objects of low complexity. For any string 푦, and any positive real natural number 푢, we have
|{푥 : 퐶(푥 | 푦) ≤ 푢}| < 2푢+1. (1.3.4)
To see this, let 푛 = blog 푢c. The number 2푛+1 − 1 of different binary strings of length ≤ 푛 is an upper bound on the number of different shortest programs of length ≤ 푢. Now (b) follows immediately from (1.3.4). The three properties of complexity contained in Theorems 1.3.1 and 1.3.2 are the ones responsible for the applicabiliy of the complexity concept. Several important later results can be considered transformations, generalizations or analogons of these. Let us elaborate on the upper bound (1.3.3). For any string 푥 ∈ S푟, we have
+ 퐶(푥) < 푛 log 푟 + 2 log 푟. (1.3.5) In particular, for any binary string 푥, we have
+ 퐶(푥) < 푙(푥).
표 푟 Indeed, let the interpreter 퐴 be defined such that if 푝 = 훽(푟) cnv2 (푥) then 푙(훽 (푟)) 퐴( 푝, 푦) = 푥. We have 퐶퐴 (푥) ≤ 2 + 푛 log 푟 + 1 ≤ (푛 + 2) log 푟 + 3. Since + 퐶(푥) < 퐶퐴 (푥), we are done. We will have many more upper bound proofs of this form, and will not always + explicitly write out the reference to 퐶(푥 | 푦) < 퐶퐴 (푥 | 푦), needed for the last step. Apart from the conversion, inequality (1.3.5) says that since the beginning of a program can command the optimal interpreter to copy the rest, the complexity of a binary sequence 푥 of length 푛 is not larger than 푛. Inequality (1.3.5) contains the term 2 log 푟 instead of log 푟 since we have to apply the encoding 푤표 to the string 푤 = 훽(푟); otherwise the interpreter cannot
15 1. Complexity detach it from the program 푝. We could use the code 훽(|푤|)표푤, which is also a prefix code, since the first, detachable part of the codeword tells its length. For binary strings 푥, natural numbers 푛 and real numbers 푢 ≥ 1 we define
퐽(푢) = 푢 + 2 log 푢, 휄(푥) = 훽(|푥|)표푥, 휄(푛) = 휄(훽(푛)).
+ + We have 푙(휄(푥)) < 퐽(푙(푥)) for binary sequences 푥, and 푙(휄(푟)) < 퐽(log 푟) for numbers 푟. Therefore (1.3.5) is true with 퐽(log 푟) in place of 2 log 푟. Of course, 퐽(푥) could be replaced by still smaller functions, for example 푥 + 퐽(log 푥). We return to this topic later.
1.4 Simple properties of information
If we transform a string 푥 by applying to it a p.r. function, then we cannot gain information over the amount contained in 푥 plus the amount needed to describe the transformation. The string 푥 becomes easier to describe, but it helps less in describing other strings 푦.
Theorem 1.4.1 For any partial recursive function 푈푞, over strings, we have
+ 퐶(푈푞 (푥) | 푦) < 퐶(푥 | 푦) + 퐽(푙(푞)), + 퐶( 푦 | 푈푞 (푥)) > 퐶( 푦 | 푥) − 퐽(푙(푞)).
+ Proof. To define an interpreter 퐴 with 퐶퐴 (푈푞 (푥) | 푦) < 퐶(푥 | 푦) + 퐽(푙(푞)), we + define 퐴(휄(푞) 푝, 푦) = 푈푞 (푈( 푝, 푦)). To define an interpreter 퐵 with 퐶퐵 ( 푦 | 푥) < 퐶( 푦 | 푈푞 (푥)) + 퐽(푞), we define 퐵(휄(푞) 푝, 푥) = 푈( 푝, 푈푞 (푥)). The following notation is natural. Definition 1.4.1 The definition of conditional complexity is extended to pairs, etc. by
퐶(푥1, . . . , 푥푚 | 푦1, . . . , 푦푛) = 퐶(h푥1, . . . , 푥푚i | h푦1, . . . , 푦푛i).
y With the new notation, here are some new results.
16 1.4. Simple properties of information
Corollary 1.4.2 For any one-to-one p.r. function 푈푝, we have
+ |퐶(푥) − 퐶(푈푝 (푥))| < 퐽(푙( 푝)).
Further,
+ 퐶(푥 | 푦, 푧) < 퐶(푥 | 푈푝 ( 푦), 푧) + 퐽(푙( 푝)), + 퐶(푈푝 (푥)) < 퐶(푥) + 퐽(푙( 푝)), + 퐶(푥 | 푧) < 퐶(푥, 푦 | 푧), (1.4.1) + 퐶(푥 | 푦, 푧) < 퐶(푥 | 푦), 퐶(푥, 푥) =+ 퐶(푥), (1.4.2) 퐶(푥, 푦 | 푧) =+ 퐶( 푦, 푥 | 푧), 퐶(푥 | 푦, 푧) =+ 퐶(푥 | 푧, 푦), 퐶(푥, 푦 | 푥, 푧) =+ 퐶( 푦 | 푥, 푧), 퐶(푥 | 푥, 푧) =+ 퐶(푥 | 푥) =+ 0.
We made several times implicit use of a basic additivity property of com- plexity which makes it possible to estimate the joint complexity of a pair by the complexities of its constituents. As expressed in Theorem 1.4.2, it says essen- tially that to describe the pair of strings, it is enough to know the description of the first member and of a method to find the second member using our knowl- edge of the first one. Theorem 1.4.2 + 퐶(푥, 푦) < 퐽(퐶(푥)) + 퐶( 푦 | 푥). Proof. We define an interpreter 퐴 as follows. We decompose any binary string 푝 into the form 푝 = 휄(푤)푞, and let 퐴( 푝, 푧) = 푈(푞, 푈(푤)). Then 퐶퐴 (푥, 푦) is bounded, to within an additive constant, by the right-hand side of the above inequality.
Corollary 1.4.3 For any p.r. function 푈푝 over pairs of strings, we have
+ 퐶(푈푝 (푥, 푦)) < 퐽(퐶(푥)) + 퐶( 푦 | 푥) + 퐽(푙( 푝)) + < 퐽(퐶(푥)) + 퐶( 푦) + 퐽(푙( 푝)), and in particular, + 퐶(푥, 푦) < 퐽(퐶(푥)) + 퐶( 푦).
17 1. Complexity
This corollary implies the following continuity property of the function 퐶(푛): for any natural numbers 푛, ℎ, we have
+ |퐶(푛 + ℎ) − 퐶(푛)| < 퐽(log ℎ). (1.4.3)
Indeed, 푛 + ℎ is a recursive function of 푛 and ℎ. The term 2 log 퐶(푥) making up the difference between 퐶(푥) and 퐽(퐶(푥)) in Theorem 1.4.2 is attributable to the fact that minimal descriptions cannot be concatenated without loosing an “endmarker”. It cannot be eliminated, since there is a constant 푐 such that for all 푛, there are binary strings 푥, 푦 of length ≤ 푛 with
퐶(푥) + 퐶( 푦) + log 푛 < 퐶(푥, 푦) + 푐.
Indeed, there are 푛2푛 pairs of binary strings whose sum of lengths is 푛. Hence by Theorem 1.3.2b, there will be a pair (푥, 푦) of binary strings whose sum of lengths is 푛, with 퐶(푥, 푦) > 푛 + log 푛 − 1. For these strings, inequality (1.3.5) + + implies 퐶(푥) + 퐶( 푦) < 푙(푥) + 푙( 푦). Hence 퐶(푥) + 퐶( 푦) + log 푛 < 푛 + log 푛 < 퐶(푥, 푦) + 1. Regularities in a string will, in general, decrease its complexity radically. If the whole string 푥 of length 푛 is given by some rule, that is we have 푥(푘) = 푈푝 (푘), for some recursive function 푈푝, then
+ 퐶(푥) < 퐶(푛) + 퐽(푙( 푝)).
Indeed, let us define the p.r. function 푉 (푞, 푘) = 푈푞 (1) . . . 푈푞 (푘). Then 푥 = 푉 ( 푝, 푛) and the above estimate follows from Corollary 1.4.2. For another example, let 푥 = 푦1 푦1 푦2 푦2 ··· 푦푛 푦푛, and 푦 = 푦1 푦2 ··· 푦푛. Then 퐶(푥) =+ 퐶( 푦) even though the string 푥 is twice longer. This follows from Corol- lary 1.4.2 since 푥 and 푦 can be obtained from each other by a simple recursive operation. Not all “regularities” decrease the complexity of a string, only those which distinguish it from the mass of all strings, putting it into some class which is both small and algorithmically definable. For a binary string√ of length 푛, it is nothing unusual to have its number of 0-s between 푛/2 − 푛 and 푛/2. Therefore such strings can have maximal or almost maximal complexity. If 푥 has 푘 zeroes then the inequality + 푛 퐶(푥) < log + 퐽(log 푛) + 퐽(log 푘) 푘 follows from Theorems 1.3.2 and 1.4.3.
18 1.5. Algorithmic properties of complexity
Theorem 1.4.3 Let 퐸 = 푊푝 be an enumerable set of pairs of strings defined enu- merated with the help of the program 푝 for example as
푊푝 = {푈( 푝, 푥) : 푥 ∈ N}. (1.4.4)
We define the section 퐸푎 = {푥 : h푎, 푥i ∈ 퐸}. (1.4.5) Then for all 푎 and 푥 ∈ 퐸푎, we have
+ 퐶(푥 | 푎) < log |퐸푎| + 퐽(푙( 푝)).
Proof. We define an interpreter 퐴(푞, 푏) as follows. We decompose 푞 into 푞 = 휄( 푝)훽(푡). From a standard enumeration (푎(푘), 푥(푘))(푘 = 1, 2,...) of the set 푊푝, we produce an enumeration 푥(푘, 푏)(푘 = 1, 2,...), without repetition, of 푏 the set 푊푝 for each 푏, and define 퐴(푞, 푏) = 푥(푡, 푏). For this interpreter 퐴, we + 푎 get 푘퐴 (푥 | 푎) < 퐽(푙( 푝)) + log |퐸 |. It follows from Theorem 1.3.2 that whenever the set 퐸푎 is finite, the estimate of Theorem 1.4.3 is sharp for most of its elements 푥. The shortest descriptions of a string 푥 seem to carry some extra information in them, above the description of 푥. This is suggested by the following strange identities. Theorem 1.4.4 We have 퐶(푥, 퐶(푥)) =+ 퐶(푥).
+ + Proof. The inequality > follows from (1.4.1). To prove <, let 퐴( 푝) = h푈( 푝, 푙( 푝))i. Then 퐶퐴 (푥, 퐶(푥)) ≤ 퐶(푥). More generally, we have the following inequality. Theorem 1.4.5 + 퐶( 푦 | 푥, 푖 − 퐶( 푦 | 푥, 푖)) < 퐶( 푦 | 푥, 푖). The proof is exercise. The above inequalities are in some sense “pathological”, and do not neces- sarily hold for all “reasonable” definitions of descriptional complexity.
1.5 Algorithmic properties of complexity
The function 퐶(푥) is not computable, as it will be shown in this section. However, it has a property closely related to computability. Let Q be the set of rational numbers.
19 1. Complexity
Definition 1.5.1 (Computability) Let 푓 : S → R be a function. It is computable if there is a recursive function 푔(푥, 푛) with rational values, and | 푓 (푥)−푔(푥, 푛)| < 1/푛. y Definition 1.5.2 (Semicomputability) A function 푓 : S → (−∞, ∞] is (lower) semicomputable if the set
{(푥, 푟) : 푥 ∈ S, 푟 ∈ Q, 푟 < 푓 (푥)} is recursively enumerable. A function 푓 is called upper semicomputable if − 푓 is lower semicomputable. y It is easy to show the following: • A function 푓 : S → R is lower semicomputable iff there exists a recursive function with rational values, (or, equivalently, a computable real function) 푔(푥, 푛) nondecreasing in 푛, with 푓 (푥) = lim푛→∞ 푔(푥, 푛). • A function 푓 : S → R is computable if it is both lower and upper semicom- putable. The notion of semicomputibility is naturally extendable over other discrete domains, like S × S. It is also useful to extend it to functions R → R. Let I denote the set of open rational intervals of R, that is
훽 = {(푝, 푞) : 푝, 푞 ∈ Q, 푝 < 푞}.
Definition 1.5.3 (Computability for real functions) Let 푓 : R → R be a funct- ion. It is computable if there is a recursively enumerable set F ⊆ 훽2 such that denoting F퐽 = { 퐼 : (퐼, 퐽) ∈ F} we have Ø 푓 −1 (퐽) = 퐼.
퐼 ∈F퐽 y
This says that if 푓 (푥) ∈ 퐼 then sooner or later we will find an interval 퐼 ∈ F퐽 with the property that 푓 (푧) ∈ 퐽 for all 푧 ∈ 퐼. Note that computability implies continuity. Definition 1.5.4 We say that 푓 : R → R is lower semicomputable if there is a recursively enumerable set G ⊆ Q × 훽 such that denoting G푞 = { 퐼 : (퐼, 푞) ∈ G} we have Ø 푓 −1 ((푞, ∞)) = 퐼.
퐼 ∈F퐽 y
20 1.5. Algorithmic properties of complexity
This says that if 푓 (푥) > 푟 then sooner or later we will find an interval 퐼 ∈ F퐽 with the property that 푓 (푧) > 푞 for all 푧 ∈ 퐼. The following facts are useful and simple to verify: Proposition 1.5.5 a) The function d·e : R → R is lower semicomputable. b) If 푓 , 푔 : R → R are computable then their composition 푓 (푔(·)) is, too. c) If 푓 : R → R is lower semicomputable and 푔 : R → R (or S → R) is computable then 푓 (푔(·)) is lower semicomputable. d) If 푓 : R → R is lower semicomputable and monotonic, and 푔 : R → R (or S → R) is lower semicomputable then 푓 (푔(·)) is lower semicomputable. Definition 1.5.6 We introduce a universal lower semicomputable function. For every binary string 푝, and 푥 ∈ S, let
푆푝 (푥) = sup{ 푦/푧 : 푦, 푧 ∈ 푍, 푧 ≠ 0, hh푥i, 푦, 푧i ∈ 푊푝 } where 푊푝 was defined in (1.4.4). y
The function 푆푝 (푥) is lower semicomputable as a function of the pair ( 푝, 푥), and for different values of 푝, it enumerates all semicomputable functions of 푥.
Definition 1.5.7 Let 푆푝 (푥1, . . . , 푥푘) = 푆푝 (h푥1, . . . , 푥푘i). For any lower semi- computable function 푓 , we call any binary string 푝 with 푆푝 = 푓 a Gödel number of 푓 . There is no universal computable function, but for any computable function 푔, we call any number hh푝i, h푞ii a Gödel number of 푔 if 푔 = 푆푝 = −푆푞. y With this notion, we claim the following. Theorem 1.5.1 The function 퐶(푥 | 푦) is upper semicomputable. Proof. By Theorem 1.3.2 there is a constant 푐 such that for any strings 푥, 푦, we have 퐶(푥 | 푦) < logh푥i+푐. Let some Turing machine compute our optimal binary p.r. interpreter. Let 푈푡 ( 푝, 푦) be defined as 푈( 푝, 푦) if this machine, when started on input ( 푝, 푦), gives an output in 푡 steps, undefined otherwise. Let 퐾푡 (푥 | 푦) be the smaller of logh푥i + 푐 and
min{푙( 푝) ≤ 푡 : 푈푡 ( 푝, 푦) = 푥 }.
푡 푡 Then the function 퐾 (푥 | 푦) is computable, monotonic in 푡 and lim푡→∞ 퐾 (푥 | 푦) = 퐶(푥 | 푦). The function 퐶(푛) is not computable. Moreover, it has no nontrivial partial recursive lower bound.
21 1. Complexity
Theorem 1.5.2 Let 푈푝 (푛) be a partial recursive function whose values are numbers smaller than 퐶(푛) whenever defined. Then
+ 푈푝 (푛) < 퐽(푙( 푝)).
Proof. The proof of this theorem resembles “Berry’s paradox ”, which says: “The least number that cannot be defined in less than 100 words” is a definition of that number in 12 words. The paradox can be used to prove, whenever we agree on a formal notion of “definition”, that the sentence in quotes is not a definition. Let 푥( 푝, 푘) for 푘 = 1, 2,... be an enumeration of the domain of 푈푝. For any natural number 푚 in the range of 푈푝, let 푓 ( 푝, 푚) be equal to the first 푥( 푝, 푘) with 푈푝 (푥( 푝, 푘)) = 푚. Then by definition, 푚 < 퐶( 푓 ( 푝, 푚)). On the other hand, 푓 ( 푝, 푚) is a p.r. function, hence applying Corollary 1.4.3 and (1.3.5) we get
+ 푚 < 퐶( 푓 ( 푝, 푚)) < 퐶( 푝) + 퐽(퐶(푚)) + < 푙( 푝) + 1.5 log 푚.
Church’s first example of an undecidable recursively enumerable set used universal partial recursive functions and the diagonal technique of Gödel and Cantor, which is in close connection to Russel’s paradox. The set he constructed is complete in the sense of “many-one reducibility”. The first example of sets not complete in this sense were Post’s simple sets. Definition 1.5.8 A computably enumerable set is simple if its complement is infinite but does not contain any infinite computably enumerable subsets. y The paradox used in the above proof is not equivalent in any trivial way to Russell’s paradox, since the undecidable computably enumerable sets we get from Theorem 1.5.2 are simple. Let 푓 (푛) ≤ log 푛 be any recursive function with lim푛→∞ 푓 (푛) = ∞. We show that the set 퐸 = {푛 : 퐶(푛) ≤ 푓 (푛)} is simple. It follows from Theorem 1.5.1 that 퐸 is recursively enumerable, and from Theorem 1.3.2 that its complement is infinite. Let 퐴 be any computably enumerable set disjoint from 퐸. The restriction 푓퐴 (푛) of the function 푓 (푛) to 퐴 is a p.r. lower bound for 퐶(푛). It follows from Theorem 1.5.2 that 푓 (퐴) is bounded. Since 푓 (푛) → ∞, this is possible only if 퐴 is finite. Gödel used the diagonal technique to prove the incompleteness of any suffi- ciently rich logical theory with a recursively enumerable axiom system. His tech- nique provides for any sufficiently rich computably enumerable theory a concrete example of an undecidable sentence. Theorem 1.5.2 leads to a new proof of this
22 1.5. Algorithmic properties of complexity result, with essentially different examples of undecidable propositions. Let us consider any first-order language 퐿 containing the language of standard first- order arithmetic. Let h·i be a standard encoding of the formulas of this theory into natural numbers. Let the computably enumerable theory 푇푝 be the theory for which the set of codes hΦi of its axioms Φ is the computably enumerable set 푊푝 of natural numbers. Corollary 1.5.9 There is a constant 푐 < ∞ such that for any 푝, if any sentences with the meaning “푚 < 퐶(푛)” for 푚 > 퐽(푙( 푝)) + 푐 are provable in theory 푇푝 then some of these sentences are false.
Proof. For some 푝, suppose that all sentences “푚 < 퐶(푛)” provable in 푇푝 are true. For a sentence Φ, let 푇푝 ` Φ denote that Φ is a theorem of the theory 푇푝. Let us define the function
퐴( 푝, 푛) = max{푚 : 푇푝 ` “푚 < 퐶(푛)”}.
This function is semicomputable since the set {( 푝, hΦi) : 푇푝 ` Φ} is recursively enumerable. Therefore there is a binary string 푎 such that 퐴( 푝, 푛) = 푆(푎, h푝, 푛i). It is easy to see that we have a constant string 푏 such that
푆(푎, h푝, 푛i) = 푆(푞, 푛) where 푞 = 푏푎푝. (Formally, this statement follows for example from the so-called 푛 푆푚-theorem.) The function 푆푞 (푛) is a lower bound on the complexity function 퐶(푛). By an immediate generalization of Theorem 1.5.2 for semicomputable + + functions), we have 푆(푞, 푛) < 퐽(푙(푞)) < 퐽(푙( 푝)). We have seen that the complexity function 퐶(푥 | 푦) is not computable, and does not have any nontrivial recursive lower bounds. It has many interesting upper bounds, however, for example the ones in Theorems 1.3.2 and 1.4.3. The following theorem characterizes all upper semicomputable upper bounds (ma- jorants) of 퐶(푥 | 푦). Theorem 1.5.3 (Levin) Let 퐹(푥, 푦) be a function of strings semicomputable from + above. The relation 퐶(푥 | 푦) < 퐹(푥, 푦) holds for all 푥, 푦 if and only if we have
+ log |{푥 : 퐹(푥, 푦) < 푚}| < 푚 (1.5.1) for all strings 푦 and natural numbers 푚.
+ Proof. Suppose that 퐶(푥 | 푦) < 퐹(푥, 푦) holds. Then (1.5.1) follows fromb of Theorem 1.3.2.
23 1. Complexity
Now suppose that (1.5.1) holds for all 푦, 푚. Then let 퐸 be the computably enumerable set of triples (푥, 푦, 푚) with 퐹(푥, 푦) < 푚. It follows from (1.5.1) + and Theorem 1.4.3 that for all (푥, 푦, 푚) ∈ 퐸 we have 퐶(푥 | 푦, 푚) < 푚. By + Theorem 1.4.5 then 퐶(푥 | 푦) < 푚. The minimum of any finite number of majorants is again a majorant. Thus, we can combine different heuristics for recognizing patterns of sequences.
1.6 The coding theorem
1.6.1 Self-delimiting complexity Some theorems on the addition of complexity do not have as simple a form as desirable. For example, Theorem 1.4.2 implies
+ 퐶(푥, 푦) < 퐽(퐶(푥)) + 퐶( 푦).
+ We would like to see just 퐶(푥) on the right-hand side, but the inequality 퐶(푥, 푦) < 퐶(푥) + 퐶( 푦) does not always hold (see Exercise5). The problem is that if we compose a program for (푥, 푦) from those of 푥 and 푦 then we have to separate them from each other somehow. In this section, we introduce a variant of Kol- mogorov’s complexity discovered by Levin and independently by Chaitin and Schnorr, which has the property that “programs” are self-delimiting: this will free us of the necessity to separate them from each other. Definition 1.6.1 A set of strings is called prefix-free if for any pair 푥, 푦 of ele- ments in it, 푥 is not a prefix of 푦. A one-to-one function into a set of strings is a prefix code, or instantaneous code if its domain of definition is prefix free. An interpreter 푓 ( 푝, 푥) is self-delimiting (s.d.) if for each 푥, the set 퐷푥 of strings 푝 for which 푓 ( 푝, 푥) is defined is a prefix-free set. y Example 1.6.2 The mapping 푝 → 푝표 is a prefix code. y A self-delimiting p.r. function 푓 ( 푝) can be imagined as a function computable on a special self-delimiting Turing machine. Definition 1.6.3 A Turing machine T is self-delimiting if it has no input tape, but can ask any time for a new input symbol. After some computation and a few such requests (if ever) the machine decides to write the output and stop. y The essential difference between a self-delimiting machine T and an ordinary Turing machine is that T does not know in advance how many input symbols
24 1.6. The coding theorem suffice; she must compute this information from the input symbols themselves, without the help of an “endmarker”. Requiring an interpreter 퐴( 푝, 푥) to be self-delimiting in 푝 has many advan- tages. First of all, when we concatenate descriptions, the interpreter will be able to separate them without endmarkers. Lost endmarkers were responsible for the additional logarithmic terms and the use of the function 퐽(푛) in Theorem 1.4.2 and several other formulas for 퐾. Definition 1.6.4 A self-delimiting partial recursive interpreter 푇 is called opti- mal if for any other s.d. p.r. interpreter 퐹, we have
+ 퐶푇 < 퐶퐹 . (1.6.1) y Theorem 1.6.1 There is an optimal s.d. interpreter. Proof. The proof of this theorem is similar to the proof of Theorem 1.3.1. We take the universal partial recursive function 푉 (푎, 푝, 푥). We transform each function 푉푎 ( 푝, 푥) = 푉 (푎, 푝, 푥) into a self-delimiting function 푊푎 ( 푝, 푥) = 푊(푎, 푝, 푥), but so as not to change the functions 푉푎 which are already self-delimiting. Then we form 푇 from 푊 just as we formed 푈 from 푉. It is easy to check that the function 푇 thus defined has the desired properties. Therefore we are done if we construct 푊. Let some Turing machine M compute 푦 = 푉푎 ( 푝, 푥) in 푡 steps. We define 푊푎 ( 푝, 푥) = 푦 if 푙( 푝) ≤ 푡 and if M does not compute in 푡 or fewer steps any output 푉푎 (푞, 푥) from any extension or prefix 푞 of length ≤ 푡 of 푝. If 푉푎 is self- delimiting then 푊푎 = 푉푎. But 푊푎 is always self-delimiting. Indeed, suppose that 푝0 v 푝1 and that M computes 푉푎 ( 푝푖, 푥) in 푡푖 steps. The value 푊푎 ( 푝푖, 푥) can be defined only if 푙( 푝푖) ≤ 푡푖. The definition of 푊푎 guarantees that if 푡0 ≤ 푡1 then 푊푎 ( 푝1, 푥) is undefined, otherwise 푊푎 ( 푝0, 푥) is undefined. Definition 1.6.5 We fix an optimal self-delimiting p.r. interpreter 푇 ( 푝, 푥) and write 퐾(푥 | 푦) = 퐶푇 (푥 | 푦). We call 퐾( 푦 | 푥) the (self-delimiting) conditional complexity of 푥 with respect to 푦. y We will use the expression “self-delimiting” only if confusion may arise. Oth- erwise, under complexity, we generally understand the self-delimiting complex- ity. Let 푇 ( 푝) = 푇 ( 푝, Λ). All definitions for unconditional complexity, joint com- plexity etc. are automatically in force for 퐻 = 퐶푇 .
25 1. Complexity
Let us remark that the s.d. interpreter defined in Theorem 1.6.1 has the following stronger property. For any other s.d. interpreter 퐺 there is a string 푔 such that we have 퐺(푥) = 푇 (푔푥) (1.6.2) for all 푥. We could say that the interpreter 푇 is universal. Let us show that the functions 퐾 and 퐻 are asymptotically equal. Theorem 1.6.2 + + 퐾 < 퐻 < 퐽(퐾). (1.6.3)
+ Proof. Obviously, 퐾 < 퐻. We define a self-delimiting p.r. function 퐹( 푝, 푥). The machine computing 퐹 tries to decompose 푝 into the form 푝 = 푢표푣 such that 푢 is the number 푙(푣) in binary notation. If it succeeds, it outputs 푈(푣, 푥). We have
+ + 퐾( 푦 | 푥) < 퐶퐹 ( 푦 | 푥) < 퐶( 푦 | 푥) + 2 log 퐶( 푦 | 푥).
Let us review some of the relations proved in Sections 1.3 and 1.4. Instead of + the simple estimate 퐶(푥) < 푛 for a binary string 푥, of length 푛, only the obvious consequence of Theorem 1.6.2 holds that is
+ 퐾(푥) < 퐽(푛).
We must similarly change Theorem 1.4.3. We will use the universal p.r. function 푇푝 (푥) = 푇 ( 푝, 푥). The r.e. set 푉푝 is the range of the function 푇푝. Let 퐸 = 푉푝 be an enumerable set of pairs of strings. The section 퐸푎 is defined as in (1.4.5). Then for all 푥 ∈ 퐸푎, we have
+ 퐾(푥 | 푎) < 퐽(log |퐸푎|) + 푙( 푝). (1.6.4)
We do not need 퐽(푙( 푝)) since the function 푇 is self-delimiting. However, the real analogon of Theorem 1.4.3 is the Coding Theorem, to be proved later in this section. Partb of Theorem 1.3.2 holds for 퐻, but is not the strongest what can be said. The counterpart of this theorem is the inequality (1.6.10) below. Theorem 1.4.1 and its corollary hold without change for 퐻. The additivity relations will be proved in a sharper form for 퐻 in the next section.
26 1.6. The coding theorem
1.6.2 Universal semimeasure Self-delimiting complexity has an interesting characterization that is very useful in applications. Definition 1.6.6 A function 푤 : 푆 → [0, 1] is called a semimeasure over the space 푆 if ∑︁ 푤(푥) ≤ 1. (1.6.5) 푥 It is called a measure (a probability distribution) if equality holds here. A semimeasure is called constructive if it is lower semicomputable. A con- structive semimeasure which multiplicatively dominates every other constructive semimeasure is called universal. y Remark 1.6.7 Just as in an earlier remark, we warn that semimeasures will later be defined over the space NN. They will be characterized by a nonnegative function 푤 over S = N∗ but with a condition different from (1.6.5). y The following remark is useful. Proposition 1.6.8 If a constructive semimeasure 푤 is also a measure then it is computable.
Proof. If we compute an approximation 푤푡 of the function 푤 from below for Í which 1 − 휀 < 푥 푤푡 (푥) then we have |푤(푥) − 푤푡 (푥)| < 휀 for all 푥. Theorem 1.6.3 There is a universal constructive semimeasure.
Proof. Every computable family 푤푖 (푖 = 1, 2,... ) of semimeasures is dominated Í −푖 by the semimeasure 푖 2 푤푖. Since there are only countably many constructive semimeasures there exists a semimeasure dominating them all. The only point to prove is that the dominating semimeasure can be made constructive. Below, we construct a function 휇 푝 (푥) lower semicomputable in 푝, 푥 such that 휇 푝 as a function of 푥 is a constructive semimeasure for all 푝 and all constructive semimeasures occur among the 푤푝. Once we have 휇 푝 we are done because Í for any positive computable function 훿( 푝) with 푝 훿( 푝) ≤ 1 the semimeasure Í 푝 훿( 푝)휇 푝 is obviously lower semicomputable and dominates all the 휇 푝. Let 푆푝 (푥) be the lower semicomputable function with Gödel number 푝 and 푡 let 푆푝 (푥) be a function recursive in 푝, 푡, 푥 with rational values, dondecreasing in 푡 푡 푡, such that lim푡 푆푝 (푥) = max{0, 푆푝 (푥)} and for each 푡, 푝, the function 푆푝 (푥) is 푡 different from 0 only for 푥 ≤ 푡. Let us define 휇 푝 recursively in 푡 as follows. Let 0 푡 푡+1 휇 푝 (푥) = 0. Suppose that 휇 푝 is already defined. If 푆푝 is a semimeasure then we
27 1. Complexity
푡+1 푡+1 푡+1 푡 푡 define 휇 푝 = 푆푝 , otherwise 휇 푝 = 휇 푝. Let 휇 푝 = lim푡 휇 푝. The function 휇 푝 (푥) is, by its definition, lower semicomputable in 푝, 푥 and a semimeasure for each fixed 푝. It is equal to 푆푝 whenever 푆푝 is a semimeasure, hence it enumerates all constructive semimeasures. The above theorem justifies the following notation. Definition 1.6.9 We choose a fixed universal constructive semimeasure and call it m(푥). y With this notation, we can make it a little more precise in which sense this measure dominates all other constructive semimeasures. Definition 1.6.10 For an arbitrary constructive semimeasure 휈 let us define ∑︁ m(휈) = {m( 푝) : the (self-delimiting) program 푝 computes 휈}. (1.6.6)
Similar notation will be used for other objects: for example now m( 푓 ) make sense for a recursive function 푓 . y Let us apply the above concept to some interesting cases. Theorem 1.6.4 For all constructive semimeasures 휈 and for all strings 푥, we have
∗ m(휈)휈(푥) < m(푥). (1.6.7)
Proof. Let us repeat the proof of Theorem 1.6.3, using m( 푝) in place of 훿( 푝). We obtain a new constructive semimeasure m0(푥) with the property m(휈)휈(푥) ≤ m0(푥) for all 푥. Noting m0(푥) =∗ m(푥) finishes the proof.
1.6.3 Prefix codes The main theorem of this section says 퐾(푥) =+ − log m(푥). Before proving it, we relate descriptional complexity to the classical coding problem of information theory. Definition 1.6.11 A (binary) code is any mapping from a set 퐸 of binary sequen- ces to a set of objects. This mapping is the decoding function of the code. If the mapping is one-to-one, then sometimes the set 퐸 itself is also called a code. A code is a prefix code if 퐸 is a prefix-free set. y The interpreter 푈( 푝) is a decoding function. Definition 1.6.12 We call string 푝 the first shortest description of 푥 if it is the lexicographically first binary word 푞 with |푞| = 퐶(푥) and 푈(푞) = 푥. y
28 1.6. The coding theorem
The set of first shortest descriptions is a code. This imposes the implicit lower bound stated in (1.5.1) on 퐶(푥):
|{푥 : 퐶(푥) = 푛}| ≤ 2푛. (1.6.8)
The least shortest descriptions in the definition of 퐾(푥) form moreover a prefix code. We recall a classical result of information theory.
Lemma 1.6.13 (Kraft’s Inequality, see [13]) For any sequence 푙1, 푙2,... of na- tural numbers, there is a prefix code with exactly these numbers as codeword lengths, if and only if ∑︁ 2−푙푖 ≤ 1. (1.6.9) 푖 Proof. Recall the standard correspondence 푥 ↔ [푥] between binary strings and binary subintervals of the interval [0, 1]. A prefix code corresponds to a set of disjoint intervals, and the length of the interval [푥] is 2−푙(푥) . This proves that (1.6.9) holds for the lengths of codewords of a prefix code. Suppose that 푙푖 is given and (1.6.9) holds. We can assume that the se- quence 푙푖 is nondecreasing. Chop disjoint, adjacent intervals 퐼1, 퐼2,... of length −푙1 −푙2 2 , 2 ,... from the left end of the interval [0, 1]. The right end of 퐼푘 is Í푘 −푙 푗 푗=1 2 . Since the sequence 푙 푗 is nondecreasing, all intervals 퐼 푗 are binary in- tervals. Take the binary string corresponding to 퐼 푗 as the 푗-th codeword.
+ Corollary 1.6.14 We have − log m(푥) < 퐾(푥). Proof. The lemma implies ∑︁ 2−퐾 ( 푦 | 푥) ≤ 1. (1.6.10) 푦
Since 퐾(푥) is an upper semicomputable function, the function 2−퐾 (푥) is lower + semicomputable. Hence it is a constructive semimeasure, and we have − log m(푥) < 퐾(푥). The construction in the second part of the proof of Lemma 1.6.13 has some disadvantages. We do not always want to rearrange the numbers 푙 푗, for example because we want that the order of the codewords reflect the order of our original objects. Without rearrangement, we can still achieve a code with only slightly longer codewords.
29 1. Complexity
Lemma 1.6.15 (Shannon-Fano code) (see [13]) Let 푤1, 푤2,... be positive num- Í bers with 푗 푤푗 ≤ 1. There is a binary prefix code 푝1, 푝2,... where the codewords are in lexicographical order, such that
| 푝푗 | ≤ − log 푤푗 + 2. (1.6.11)
Proof. We follow the construction above and cut off disjoint, adjacent (not nec- essarily binary) intervals 퐼 푗 of length 푤푗 from the left end of [0, 1]. Let 푣푗 be the length of the longest binary intervals contained in 퐼 푗. Let 푝푗 be the binary word corresponding to the first one of these. Four or fewer intervals of length 푣푗 cover 퐼 푗. Therefore (1.6.11) holds.
1.6.4 The coding theorem for 퐾(푥) The above results suggest 퐾(푥) ≤ − log m(푥). But in the proof, we have to deal with the problem that m is not computable. Theorem 1.6.5 (Coding Theorem, see [31, 17, 10])
퐾(푥) =+ − log m(푥) (1.6.12)
Proof. We construct a self-delimiting p.r. function 퐹( 푝) with the property that 퐶퐹 (푥) ≤ − log m(푥) + 4. The function 퐹 to be constructed is the decoding funct- ion of a prefix code hence the code-construction of Lemma 1.6.13 proposes itself. But since the function m(푥) is only lower semicomputable, it is given only by a sequence converging to it from below. Let {(푧푡, 푘푡) : 푡 = 1, 2,...} be a recursive enumeration of the set {(푥, 푘) : 푘 < m(푥)} without repetition. Then ∑︁ ∑︁ ∑︁ ∑︁ 2−푘푡 = 2−푘푡 ≤ 2m(푥) < 2.
푡 푥 푧푡=푥 푥
−푘푡−1 Let us cut off consecutive adjacent, disjoint intervals 퐼푡 of length 2 from the left side of the interval [0, 1]. We define 퐹 as follows. If [푝] is a largest binary subinterval of some 퐼푡 then 퐹( 푝) = 푧푡. Otherwise 퐹( 푝) is undefined. The function 퐹 is obviously self-delimiting and partial recursive. It follows from the construction that for every 푥 there is a 푡 with 푧푡 = 푥 and 0.5m(푥) < 2−푘푡 . Therefore, for every 푥 there is a 푝 such that 퐹( 푝) = 푥 and | 푝| ≤ − log m(푥)+ 4.
30 1.6. The coding theorem
The Coding Theorem can be straightforwardly generalized as follows. Let 푓 (푥, 푦) be a lower semicomputable nonnegative function. Then we have + 퐾( 푦 | 푥) < − log 푓 (푥, 푦) (1.6.13) Í for all 푥 with 푦 푓 (푥, 푦) ≤ 1. The proof is the same.
1.6.5 Algorithmic probability We can interpret the Coding Theorem as follows. It is known in classical infor- mation theory that for any probability distribution 푤(푥)(푥 ∈ 푆) a binary prefix code 푓푤 (푥) can be constructed such that 푙( 푓푤 (푥)) ≤ − log 푤(푥) + 1. We learned that for computable, moreover, even for contstructive distributions 푤 there is a universal code with a self-delimiting partial-recursive decoding function 푇 inde- pendent of 푤 such that for the codeword length 퐾(푥) we have
퐾(푥) ≤ − log 푤(푥) + 푐푤.
Here, only the additive constant 푐푤 depends on the distribution 푤. Let us imagine the self-delimiting Turing machine T computing our inter- preter 푇 ( 푝). Every time the machine asks for a new bit of the description, we could toss a coin to decide whether to give 0 or 1.
Definition 1.6.16 Let 푃푇 (푥) be the probability that the self-delimiting machine T gives out result 푥 after receiving random bits as inputs. y Í −푙( 푝) We can write 푃푇 (푥) = 푊푇 (푥) where 푊푇 (푥) is the set {2 : 푇 ( 푝) = 푥 }. −퐾 (푥) Since 2 = max 푊푇 (푥), we have − log 푃푇 (푥) ≤ 퐾(푥). The semimeasure 푃푇 ∗ is constructive, hence 푃 < m, hence + + 퐻 < − log m < − log 푃푇 ≤ 퐻, hence + + − log 푃푇 = − log m = 퐻.
Hence the sum 푃푇 (푥) of the set 푊푇 (푥) is at most a constant times larger than its maximal element 2−퐾 (푥) . This relation was not obvious in advance. The outcome 푥 might have high probability because it has many long descriptions. But we found that then it must have a short description too. In what follows it will be convenient to fix the definition of m(푥) as follows. Definition 1.6.17 From now on let us define
m(푥) = 푃푇 (푥). (1.6.14) y
31 1. Complexity
1.7 The statistics of description length
+ We can informally summarize the relation 퐻 = − log 푃푇 saying that if an object has many long descriptions then it has a short one. But how many descriptions does an object really have? With the Coding Theorem, we can answer several questions of this kind. The reader can view this section as a series of exercises in the technique acquired up to now. Theorem 1.7.1 Let 푓 (푥, 푛) be the number of binary strings 푝 of length 푛 with 푇 ( 푝) = 푥 for the universal s.d. interpreter 푇 defined in Theorem 1.6.1. Then for every 푛 ≥ 퐾(푥), we have
log 푓 (푥, 푛) =+ 푛 − 퐾(푥, 푛). (1.7.1)
+ Using 퐾(푥) < 퐾(푥, 푛), which holds just as (1.4.1), and substituting 푛 = 퐾(푥) + we obtain log 푓 (푥, 퐾(푥)) =+ 퐾(푥) − 퐾(푥, 퐾(푥)) < 0, implying the following. Corollary 1.7.1 The number of shortest descriptions of any object is bounded by a universal constant. Since the number log 푓 (푥, 퐾(푥)) is nonnegative, we also derived the identity
퐾(푥, 퐾(푥)) =+ 퐾(푥) (1.7.2) which could have been proven also in the same way as Theorem 1.4.4.
Proof of Theorem 1.7.1. It is more convenient to prove equation (1.7.1) in the form 2−푛 푓 (푥, 푛) =∗ m(푥, 푛) (1.7.3) ∗ where m(푥, 푦) = m(h푥, 푦i). First we prove <. Let us define a p.r. self-delimiting function 퐹 by 퐹( 푝) = h푇 ( 푝), 푙( 푝)i. Applying 퐹 to a coin-tossing argument, 2−푛 푓 (푥, 푛) is the probability that the pair h푥, 푛i is obtained. Therefore the left- hand side of (1.7.3), as a function of h푥, 푛i, is a constructive semimeasure, dom- inated by m(푥, 푛). ∗ Now we prove >. We define a self-delimiting p.r. function 퐺 as follows. The machine computing 퐺( 푝) tries to decompose 푝 into three segments 푝 = 훽(푐)표푣푤 in such a way that 푇 (푣) is a pair h푥, 푙( 푝) + 푐i. If it succeeds then it outputs 푥. By the universality of 푇, there is a binary string 푔 such that 푇 (푔푝) = 퐺( 푝) for all 푝. Let 푟 = 푙(푔). For an arbitrary pair 푥, 푛, let 푞 be a shortest binary string with 푇 (푞) = h푥, 푛i, and 푤 an arbitrary string of length
푙 = 푛 − 푙(푞) − 푟 − 푙(훽(푟)표).
32 1.7. The statistics of description length
Then 퐺(훽(푟)표푞푤) = 푇 (푔훽(푟)표푞푤) = 푥. Since 푤 is arbitrary here, there are 2푙 binary strings 푝 of length 푛 with 푇 ( 푝) = 푥. How many objects are there with a given complexity 푛? We can answer this question with a good approximation.
Definition 1.7.2 Let 푔푇 (푛) be the number of objects 푥 ∈ S with 퐾(푥) = 푛, and 퐷푛 the set of binary strings 푝 of length 푛 for which 푇 ( 푝) is defined. Let us define the moving average 푐 1 ∑︁ ℎ푇 (푛, 푐) = 푔푇 (푛 + 푖). 2푐 + 1 푖=−푐 y Here is an estimation of these numbers. Theorem 1.7.2 ([47]) There is a natural number 푐 such that
+ log |퐷푛| = 푛 − 퐾(푛), (1.7.4) + log ℎ푇 (푛, 푐) = 푛 − 퐾(푛). (1.7.5) Since we are interested in general only in accuracy up to additive constants, we could omit the normalizing factor 1/(2푐 + 1) from the moving average ℎ푇 . We do not know whether this average can be replaced by 푔푇 (푛). This might depend on the special universal partial recursive function we use to construct the optimal s.d. interpreter 푇 ( 푝). But equation (1.7.5) is true for any optimal s.d. interpreter 푇 (such that the inequality (1.6.1) holds for all 퐹) while there is an optimal s.d. interpreter 퐹 for which 푔퐹 (푛) = 0 for every odd 푛. Indeed, let 퐹(00푝) = 푇 ( 푝) if 푙( 푝) is even, 퐹(1푝) = 푇 ( 푝) if 푙( 푝) is odd and let 퐹 be undefined in all other cases. Then 퐹 is defined only for inputs of even length, while 퐶퐹 ≤ 퐶푇 + 2. Lemma 1.7.3 ∑︁ ∗ m(푥, 푦) = m(푥). 푦 Proof. The left-hand side is a constructive semimeasure therefore it is dominated ∗ by the right side. To show >, note that by equation (1.4.2) as applied to 퐻, we + ∗ Í have 퐾(푥, 푥) = 퐾(푥). Therefore m(푥) = m(푥, 푥) < 푦 m(푥, 푦).
Proof of Theorem 1.7.2: Let 푑푛 = |퐷푛|. Using Lemma 1.7.3 and Theorem 1.7.1 we have ∑︁ ∗ 푛 ∑︁ ∗ 푛 푔푇 (푛) ≤ 푑푛 = 푓 (푥, 푛) = 2 m(푥, 푛) = 2 m(푛). 푥 푥
33 1. Complexity
Since the complexity 퐻 has the same continuity property (1.4.3) as 퐾, we can 푛 푐 2 ∗ derive from here ℎ푇 (푛, 푐) ≤ 2 m(푛)푂(2 푐 ). To prove ℎ푇 (푛, 푐) > 푑푛 for an appropriate 푐, we will prove that the complexity of at least half of the elements of 퐷푛 is near 푛. For some constant 푐0, we have 퐾( 푝) ≤ 푛 + 푐0 for all 푝 ∈ 퐷푛. Indeed, if the s.d. p.r. interpreter 퐹( 푝) is equal to 푝 where 푇 ( 푝) is defined and is undefined + otherwise then 퐶퐹 ( 푝) ≤ 푛 for 푝 ∈ 퐷푛. By 퐾( 푝, 푙( 푝)) = 퐾( 푝), and a variant of Lemma 1.7.3, we have
∑︁ ∗ ∑︁ ∗ m( 푝) = m( 푝, 푛) = m(푛).
푝∈퐷푛 푝∈퐷푛
Using the expression (1.7.4) for 푑푛, we see that there is a constant 푐1 such that
−1 ∑︁ −퐾 ( 푝) −푛+푐1 푑푛 2 ≤ 2 . 푝∈퐷푛
Using Markov’s Inequality, hence the number of elements 푝 of 퐷푛 with 퐾( 푝) ≥ 푛 − 푐1 − 1 is at least 푑푛/2. We finish the proof making 푐 to be the maximium of 푐0 and 푐1 + 1. What can be said about the complexity of a binary string of length 푛? We + mentioned earlier the estimate 퐾(푥) < 푛 + 2 log 푛. The same argument gives the stronger estimate + 퐾(푥) < 푙(푥) + 퐾(푙(푥)). (1.7.6) By a calculation similar to the proof of Theorem 1.7.2, we can show that the estimate (1.7.6) is exact for most binary strings. First, it follows just as in Lemma 1.7.3 that there is a constant 푐 such that ∑︁ 2−푛 2−퐾 ( 푝) ≤ 푐2−푛m(푛). 푙( 푝)=푛
From this, Markov’s Inequality gives that the number of strings 푝 ∈ B푛 with 퐾( 푝) < 푛 − 푘 is at most 푐2푛−푘. There is a more polished way to express this result. Definition 1.7.4 Let us introduce the following function of natural numbers:
퐾+(푛) = max 퐾(푘). 푘≤푛 y
34 1.7. The statistics of description length
This is the smallest monotonic function above 퐾(푛). It is upper semicom- + putable just as 퐻 is, hence 2−퐾 is universal among the monotonically decreasing constructive semimeasures. The following theorem expresses 퐾+ more directly in terms of 퐾. Theorem 1.7.3 We have 퐾+(푛) =+ log 푛 + 퐾(blog 푛c).
+ Proof. To prove <, we construct a s.d. interpreter as follows. The machine com- puting 퐹( 푝) finds a decomposition 푢푣 of 푝 such that 푇 (푢) = 푙(푣), then outputs the number whose binary representation (with possible leading zeros) is the binary string 푣. With this 퐹, we have
+ 퐾(푘) < 퐶퐹 (푘) ≤ log 푛 + 퐻(blog 푛c) (1.7.7) for all 푘 ≤ 푛. The numbers between 푛/2 and 푛 are the ones whose binary representation has length blog 푛c + 1. Therefore if the bound (1.7.6) is sharp for most 푥 then the bound (1.7.7) is sharp for most 푘. The sharp monotonic estimate log 푛 + 퐾(blog 푛c) for 퐾(푛) is less satisfactory than the sharp estimate log 푛 for Kolmogorov’s complexity 퐶(푛), because it is not a computable function. We can derive several computable upper bounds from it, for example log 푛 + 2 log log 푛, log 푛 + log log 푛 + 2 log log log 푛, etc. but none of these is sharp for large 푛. We can still hope to find computable upper bounds of 퐻 which are sharp infinitely often. The next theorem provides one. Theorem 1.7.4 (see [47]) There is a computable upper bound 퐺(푛) of the funct- ion 퐾(푛) with the property that 퐺(푛) = 퐾(푛) holds for infinitely many 푛. For the proof, we make use of a s.d. Turing machine T computing the funct- ion 푇. Similarly to the proof of Theorem 1.5.1, we introduce a time-restricted complexity. Definition 1.7.5 We know that for some constant 푐, the function 퐾(푛) is bounded by 2 log 푛 + 푐. Let us fix such a 푐. For any number 푛 and s.d. Turing machine M, let 퐾M (푛; 푡) be the minimum of 2 log 푛 + 푐 and the lengths of descriptions from which M computes 푛 in 푡 or fewer steps. y At the time we proved the Invariance Theorem we were not interested in exactly how the universal p.r. function 푉 was to be computed. Now we need to be more specific. Definition 1.7.6 Let us agree now that the universal p.r. function is computed by a universal Turing machine V with the property that for any Turing machine
35 1. Complexity
M there is a constant 푚 such that the number of steps needed to simulate 푡 steps of M takes no more than 푚푡 steps on V. Let T be the Turing machine constructed from V computing the optimal s.d. interpreter 푇, and let us write
퐾(푛, 푡) = 퐾T (푛, 푡).
y With these definitions, for each s.d. Turing machine M there are constants 푚0, 푚1 with 퐾(푛, 푚0푡) ≤ 퐾M (푛, 푡) + 푚1. (1.7.8) The function 퐾(푛; 푡) is nonincreasing in 푡.
Proof of Theorem 1.7.4. First we prove that there are constants 푐0, 푐1 such that 퐾(푛; 푡) < 퐾(푛; 푡 − 1) implies
퐾(h푛, 푡i; 푐0푡) ≤ 퐾(푛; 푡) + 푐1. (1.7.9)
Indeed, we can construct a Turing machine M simulating the work of T such that if T outputs a number 푛 in 푡 steps then M outputs the pair h푛, 푡i in 2푡 steps:
퐾M (h푛, 푡i; 2푡) ≤ 퐾(푛, 푡).
Combining this with the inequality (1.7.8) gives the inequality (1.7.9). We de- fine now a recursive function 퐹 as follows. To compute 퐹(푥), the s.d. Turing machine first tries to find 푛, 푡 such that 푥 = h푛, 푡i. (It stops even if it did not find them.) Then it outputs 퐾(푥; 푐0푡). Since 퐹(푥) is the length of some description of 푥, it is an upper bound for 퐾(푥). On the other hand, suppose that 푥 is one of the infinitely many integers of the form h푛, 푡i with
퐾(푛) = 퐾(푛; 푡) < 퐾(푛; 푡 − 1).
+ Then the inequality (1.7.9) implies 퐹(푥) ≤ 퐾(푛) + 푐1 while 퐾(푛) < 퐾(푥) is known (it is the equivalent of 1.4.4 for 퐻), so 퐹(푥) ≤ 퐾(푛) + 푐 for some constant 푐. Now if 퐹 would not be equal to 퐻 at infinitely many places, only close to it (closer than 푐) then we can decrease it by some additive constant less than 푐 and modify it at finitely many places to obtain the desired function 퐺.
36 2 Randomness
2.1 Uniform distribution
Speaking of a “random binary string”, one generally understands randomness with respect to the coin-toss distribution (when the probability 푃(푥) of a binary sequence 푥 of length 푛 is 2−푛). This is the only distribution considered in the present section. One can hope to distinguish sharply between random and nonrandom infi- nite sequences. Indeed, an infinite binary sequence whose elements are 0 with only finitely many exceptions, can be considered nonrandom without qualifi- cation. This section defines randomness for finite strings. For a sequence 푥 of length 푛, it would be unnatural to fix some number 푘, declare 푥 nonrandom when its first 푘 elements are 0, but permit it to be random if only the first 푘 − 1 elements are 0. So, we will just declare some finite strings less random than others. For this, we introduce a certain real-valued function 푑(푥) ≥ 0 measur- ing the deficiency of randomness in the string 푥. The occurrence of a nonrandom sequence is considered an exceptional event, therefore the function 푑(푥) can assume large values only with small probability. What is “large” and “small” depends only on our “unit of measurement”. We require for all 푛, 푘 ∑︁ { 푃(푥) : 푥 ∈ B푛, 푑(푥) > 푘} < 2−푘, (2.1.1) saying that there be at most 2푛−푘 binary sequences 푥 of length 푛 with 푑(푥) > 푘. Under this condition, we even allow 푑(푥) to take the value ∞. To avoid arbitrariness in the distinction between random and nonrandom, the function 푑(푥) must be simple. We assume therefore that the set {(푛, 푘, 푥) : 푙(푥) = 푛, 푑(푥) > 푘} is recursively enumerable, or, which is the same, that the function
d : Ω → (−∞, ∞]
37 2. Randomness is lower semicomputable. Remark 2.1.1 We do not assume that the set of strings 푥 ∈ B푛 with small defi- ciency of randomness is enumerable, because then it would be easy to construct a “random” sequence. y
Definition 2.1.2 A lower semicomputable function 푑 : S2 → R (where R is the set of real numbers) is called a Martin-Löf test (ML-test), or a probability-bounded test if it satisfies (2.1.1). A ML-test 푑0 (푥) is universal if it additively dominates all other ML-tests: for any other ML-test 푑(푥) there is a constant 푐 < ∞ such that for all 푥 we have 푑(푥) < 푑0 (푥) + 푐. y
If a test 푑0 is universal then any other test 푑 of randomness can discover at most by a constant amount more deficiency of randomness in any sequence 푥 than 푑0 (푥). Obviously, the difference of any two universal ML-tests is bounded by some constant. The following theorem reveals a simple connnection between descriptional complexity and a certain randomness property. Definition 2.1.3 Let us define the following function for binary strings 푥:
푑0 (푥) = 푙(푥) − 퐶(푥 | 푙(푥)). (2.1.2)
y
Theorem 2.1.1 (Martin-Löf) The function 푑0 (푥) is a universal Martin-Löf test.
Proof. Since 퐶(푥 | 푦) is semicomputable from above, 푑0 is semicomputable from below. The property of semicomputability holds for 푑0 as a straightforward con- sequence of Theorem 1.5.1 Therefore 푑0 is a ML-test. We must show that it is larger (to within an additive constant) than any other ML-test. Let 푑 be a ML- test. Let us define the function 퐹(푥, 푦) to be 푦 − 푑(푥) for 푦 = 푙(푥), and ∞ otherwise. Then 퐹 is upper semicomputable, and satisfies (1.5.1). Therefore by + Theorem 1.5.3, we have 퐶(푥 | 푙(푥)) < 푙(푥) − 푑(푥). Theorem 2.1.1 says that under very general assumptions about randomness, those strings 푥 are random whose descriptional complexity is close to its max- imum, 푙(푥). The more “regularities” are discoverable in a sequence, the less random it is. However, this is true only of regularities which decrease the de- scriptional complexity. “Laws of randomness” are regularities whose probability is high, for example the law of large numbers, the law of iterated logarithm, the arcsine law, etc. A random sequence will satisfy all such laws.
38 2.1. Uniform distribution
Example 2.1.4 The law of large numbers says that in most binary sequences of length 푛, the number of 0’s is close to the numer of 1’s. We prove this law of prob- ability theory by constructing a ML-test 푑(푥) taking large values on sequences in which the number of 0’s is far from the number of 1’s. Instead of requiring (2.1.1) we require a somewhat stronger property of 푑(푥): for all 푛, ∑︁ 푃(푥)2푑 (푥) ≤ 1. (2.1.3) 푥 ∈B푛 From this, the inequality (2.1.1) follows by the following well-known in- equality called the Markov Inequality which says the following. Let 푃 be any probability distribution, 푓 any nonnegative function, with expected value Í 퐸푃 ( 푓 ) = 푥 푃(푥) 푓 (푥). For all 휆 ≥ 0 we have ∑︁ { 푃(푥) : 푓 (푥) > 휆퐸푃 ( 푓 ) } ≤ 휆. (2.1.4)
For any string 푥 ∈ S, and natural number 푖, let 푁 (푖 | 푥) denote the number of occurrences of 푖 in 푥. For a binary string 푥 of length 푛 define 푝푥 = 푁 (1 | 푥)/푛, and
푁 (1 | 푦) 푁 (0 | 푦) 푃푥 ( 푦) = 푝푥 (1 − 푝푥)
푑(푥) = log 푃푥 (푥) + 푛 − log(푛 + 1).
We show that 푑(푥) is a ML-test. It is obviously computable. We prove (2.1.3). We have
∑︁ 푑 (푥) ∑︁ −푛 푛 1 1 ∑︁ 푃(푥)2 = 2 2 푃푥 (푥) = 푃푥 (푥) 푛 + 1 푛 + 1 푥 푥 푥 푛 푛 1 ∑︁ 푛 푘 푘 1 ∑︁ = ( )푘 (1 − )푛−푘 < 1 = 1. 푛 + 1 푘 푛 푛 푛 + 1 푘=0 푘=0
The test 푑(푥) expresses the (weak) law of large numbers in a rather strong ver- sion. We rewrite 푑(푥) as
푑(푥) = 푛(1 − ℎ( 푝푥)) − log(푛 + 1) where ℎ( 푝) = −푝 log 푝 − (1 − 푝) log(1 − 푝). The entropy function ℎ( 푝) achieves its maximum 1 at 푝 = 1/2. Therefore the test 푑(푥) tells us that the probability of sequences 푥 with 푝푥 < 푝 < 1/2 for some constant 푝 is bounded by the
39 2. Randomness exponentially decreasing quantitiy (푛 + 1)2−푛(1−ℎ( 푝)) . We also see that since for some constant 푐 we have
1 − ℎ( 푝) > 푐( 푝 − 1/2)2, therefore if the difference | 푝푥 − 1/2| is much larger than √︂ log 푛 푐푛 then the sequence 푥 is “not random”, and hence, by Theorem 2.1.1 its complexity is significantly smaller than 푛. y
2.2 Computable distributions
Let us generalize the notion of randomness to arbitrary discrete computable probability distributions.
2.2.1 Two kinds of test Let 푃(푥) be a probability distribution over some discrete countable space Ω, which we will identify for simplicity with the set S of finite strings of natural numbers. Thus, 푃 is a nonnegative function with ∑︁ 푃(푥) = 1. 푥
Later we will consider probability distributions 푃 over the space NN, and they will also be characterized by a nonnegative function 푃(푥) over S. However, the above condition will be replaced by a different one. We assume 푃 to be computable, with some Gödel number 푒. We want to define a test 푑(푥) of randomness with respect to 푃. It will measure how justified is the assumption that 푥 is the outcome of an experiment with distribution 푃. Definition 2.2.1 A function 푑 : S → R is an integrable test, or expectation- bounded test of randomness with respect to 푃 if it is lower semicomputable and satisfies the condition ∑︁ 푃(푥)2푑 (푥) ≤ 1. (2.2.1) 푥 It is universal if it dominates all other integrable tests to within an additive con- stant. y
40 2.2. Computable distributions
(A similar terminology is used in [36] in order to distinguish tests satisfying condition (2.2.1) from Martin-Löf tests.) Remark 2.2.2 We must allow 푑(푥) to take negative values, even if only, say val- ues ≥ −1, since otherwise condition (2.2.1) could only be satisfied with 푑(푥) = 0 for all 푥 with 푃(푥) > 0. y Proposition 2.2.3 Let 푐 = log(2휋2/6). If a function 푑(푥) satisfies (2.2.1) then it satisfies (2.1.1). If 푑(푥) satisfies (2.1.1) then 푑 − 2 log 푑 − 푐 satisfies (2.2.1). Proof. The first statement follows by Markov’s Inequality (2.1.4). The second one can be checked by immediate computation. Thus, condition 2.2.1 is nearly equivalent to condition (2.1.1): to each test 푑 according to one of them, there is a test 푑0 asymptotically equal to 푑 satisfying the other. But condition (2.2.1) has the advantage of being just one inequality instead of the infinitely many ones.
2.2.2 Randomness via complexity We want to find a universal integrable test. Let us introduce the function 푤(푥) = 푃(푥)2푑 (푥) . The above conditions on 푑(푥) imply that 푤(푥) is semicomputable Í from below and satisfies 푥 푤(푥) ≤ 1: so, it is a constructive semimeasure. With the universal constructive semimeasure of Theorem 1.6.3, we also obtain a universal test for any computable probability distribution 푃. Definition 2.2.4 For an arbitrary measure 푃 over a discrete space Ω, let us de- note m(푥) d푃 (푥) = log = − log 푃(푥) − 퐾(푥). (2.2.2) 푃(푥) y The following theorem shows that randomness can be tested by checking how close is the complexity 퐾(푥) to its upper bound − log 푃(푥).
Theorem 2.2.1 The function d푃 (푥) = − log 푃(푥) − 퐾(푥) is a universal integrable test for any fixed computable probability distribution 푃. More exactly, it is lower semicomputable, satisfies (2.2.1) and for all integrable tests 푑(푥) for 푃, we have
+ 푑(푥) < d푃 (푥) + 퐾(푑) + 퐾(푃).
Proof. Let 푑(푥) be an integrable test for 푃. Then 휈(푥) = 2푑 (푥) 푃(푥) is a construc- + tive semimeasure, and it has a self-delimiting program of length < 퐾(푑) + 퐾(푃).
41 2. Randomness
∗ It follows (using the definition (1.6.6)) that 푚(휈) < 2퐾 (푑)+퐾 (푃) ; hence inequal- ity (1.6.7) gives ∗ 휈(푥)2퐾 (푑)+퐾 (푃) < m(푥). Taking logarithms finishes the proof. The simple form of our universal test suggests remarkable interpretations. It says that the outcome 푥 is random with respect to the distribution 푃, if the latter assigns to 푥 a large enough probability—however, not in absolute terms, only relatively to the universal semimeasure m(푥). The relativization is essen- tial, since otherwise we could not distinguish between random and nonrandom outcomes for the uniform distribution 푃푛 defined in Section 2.1. For any (not necessarily computable) distribution 푃, the 푃-expected value of the function m(푥)/푃(푥) is at most 1, therefore the relation
m(푥) ≤ 푘푃(푥) holds with probability not smaller than 1 − 1/푘. If 푃 is computable then the inequality ∗ m(푃)푃(푥) < m(푥) holds for all 푥. Therefore the following applications are obtained. • If we assume 푥 to be the outcome of an experiment with some simple com- putable probability distribution 푃 then m(푥) is a good estimate of 푃(푥). The goodness depends on how simple 푃 is to define and how random 푥 is with respect to 푃: how justified is the assumption. Of course, we cannot compute m(푥) but it is nevertheless well defined. • If we trust that a given object 푥 is random with respect to a given distribution 푃 then we can use 푃(푥) as an estimate of m(푥). The degree of approximation depends on the same two factors. Let us illustrate the behavior of the universal semimeasure m(푥) when 푥 runs over the set of natural numbers. We define the semimeasures 푣 and 푤 as follows. 푘 Let 푣(푛) = 푐/푛2 for an appropriate constant 푐, let 푤(푛) = 푣(푥) if 푛 = 22 , and 0 otherwise. Then m(푛) dominates both 푣(푛) and 푤(푛). The function m(푛) dominates 1/푛 log2 푛, but jumps at many places higher than that. Indeed, it is easy to see that m(푛) converges to 0 slower than any positive computable function converging to 0: in particular, it is not computable. Hence it cannot Í be a measure, that is 푥 m(푥) < 1. We feel that we can make m(푥) “large” whenever 푥 is “simple”.
42 2.2. Computable distributions
We cannot compare the universal test defined in the present section directly with the one defined in Section 2.1. There, we investigated tests for a whole 푛 family 푃푛 of distributions, where 푃푛 is the uniform distribution on the set B . That the domain of these distributions is a different one for each 푛 is not essential since we can identify the set B푛 with the set {2푛,..., 2푛+1 − 1}. Let us rewrite the function 푑0 (푥) defined in (2.1.2) as
푑0 (푥) = 푛 − 퐶(푥 | 푛) for 푥 ∈ B푛, and set it ∞ for 푥 outside B푛. Now the similarity to the expression
d푃 (푥) = − log 푃(푥) − log m(푥) is striking because 푛 = log 푃푛 (푥). Together with the remark made in the previous paragraph, this observation suggests that the value of − log m(푥) is close to the complexity 퐶(푥).
2.2.3 Conservation of randomness The idea of conservation of randomness has been expressed first by Levin, who has published some papers culminating in [33] developing a general theory. In this section, we address the issue in its simplest form only. Suppose that we want to talk about the randomness of integers. One and the same integer 푥 can be represented in decimal, binary, or in some other notation: this way, every time a different string will represent the same number. Still, the form of the representation should not affect the question of randomness of 푥 significantly. How to express this formally? Let 푃 be a computable measure and let 푓 : S → S be a computable function. We expect that under certain conditions, the randomness of 푥 should imply the randomness of 푓 (푥)—but, for which distribution? For example, if 푥 is a binary string and 푓 (푥) is the decimal notation for the number expressed by 1푥 then we expect 푓 (푥) to be random not with respect to 푃, but with respect to the measure 푓 ∗푃 that is the image of 푃 under 푓 . This measure is defined by the formula ∑︁ ( 푓 ∗푃)( 푦) = 푃( 푓 −1 ( 푦)) = 푃(푥). (2.2.3) 푓 (푥)=푦
Proposition 2.2.5 If 푓 is a computable function and 푃 is a computable measure then 푓 ∗푃 is a computable measure.
43 2. Randomness
Proof. It is sufficient to show that 푓 ∗푃 is lower semicomputable, since we have seen that a lower semicomputable measure is computable. But, the semicom- putability can be seen immediately from the definition. Let us see that randomness is indeed preserved for the image. Similarly to (1.6.6), let us define for an arbitrary computable function 푓 : ∑︁ m( 푓 ) = {m( 푝) : program 푝 computes 푓 }. (2.2.4)
Theorem 2.2.2 Let 푓 be a computable function. Between the randomness test for 푃 and the test for the image of 푃, the following relation holds, for all 푥:
+ d 푓 ∗푃 ( 푓 (푥)) < d푃 (푥) + 퐾( 푓 ) + 퐾(푃). (2.2.5)
Proof. Let us denote the function on the left-hand side of (2.2.5) by
m( 푓 (푥)) 푑푃 (푥) = d 푓 ∗푃 ( 푓 (푥)) = log . ( 푓 ∗푃)( 푓 (푥))
It is lower semicomputable, with the help of a program of length 퐾( 푓 ) + 퐾(푃), by its very definition. Let us check that it is an integrable test, so it satisfies the inequality (2.2.1). We have
∑︁ ( ) ∑︁ ∗ ∗ ( ) ∑︁ 푃(푥)2푑푃 푥 = ( 푓 푃)( 푦)2d 푓 푃 푦 = m( 푦) ≤ 1. 푥 푦 푦
+ Hence 푑푃 (푥) is a test, and hence it is < d푃 (푥)+퐾( 푓 )+퐾(푃), by the universality of the test d푃 (푥). (Strictly speaking, we must modify the proof of Theorem 2.2.1 and see that a program of length 퐾( 푓 ) + 퐾(푃) is also sufficient to define the semimeasure 푃(푥)푑(푥 | 푃).) The appearance of the term 퐾( 푓 ) on the right-hand side is understandable: using some very complex function 푓 , we can certainly turn any string into any other one, also a random string into a much less random one. On the other hand, the term 퐾(푃) appears only due to the imperfection of our concepts. It will disappear in the more advanced theory presented in later sections, where we develop uniform tests. Example 2.2.6 For each string 푥 of length 푛 ≥ 1, let
2−푛 푃(푥) = . 푛(푛 + 1)
44 2.2. Computable distributions
This distribution is uniform on strings of equal length. Let 푓 (푥) be the function that erases the first 푘 bits. Then for 푛 ≥ 1 and for any string 푥 of length 푛 we have 2−푛 ( 푓 ∗푃)(푥) = . (푛 + 푘)(푛 + 푘 + 1) (The empty string has the rest of the weight of 푓 ∗푃.) The new distribution is still uniform on strings of equal length, though the total probability of strings of length 푛 has changed somewhat. We certainly expect randomnes conservation here: after erasing the first 푘 bits of a random string of length 푛 + 푘, we should still get a random string of length 푛. y The computable function 푓 applied to a string 푥 can be viewed as some kind of transformation 푥 ↦→ 푓 (푥). As we have seen, it does not make 푥 less random. Suppose now that we introduce randomization into the transformation process itself: for example, the machine computing 푓 (푥) can also toss some coins. We want to say that randomness is also conserved under such more general transfor- mations, but first of all, how to express such a transformation mathematically? We will describe it by a “matrix”: a computable probability transition function 푇 (푥, 푦) ≥ 0 with the property that ∑︁ 푇 (푥, 푦) = 1. 푦
Now, the image of the distribution 푃 under this transformation can be written as 푇∗푃, and defined as follows: ∑︁ (푇∗푃)( 푦) = 푃(푥)푇 (푥, 푦). 푥 How to express now randomness conservation? It is certainly not true that every possible outcome 푦 is as random with respect to 푇∗푃 as 푥 is with respect to 푃. We can only expect that for each 푥, the conditional probability (in terms of the transition 푇 (푥, 푦)) of those 푦 whose non-randomness is larger than that of 푥, is small. This will indeed be expressed by the corollary below. To get there, we ∗ ( ) upperbound the 푇 (푥, ·)-expected value of 2d푇 푃 푦 . Let us go over to exponential notation: Definition 2.2.7 Denote m(푥) d푃 (푥) t푃 (푥) = 2 = . (2.2.6) 푃(푥) y
45 2. Randomness
Theorem 2.2.3 We have ∑︁ + d푇∗푃 ( 푦) log 푇 (푥, 푦)2 < d푃 (푥) + 퐾(푇) + 퐾(푃). (2.2.7) 푦
Proof. Let us denote the function on the left-hand side of (2.2.7) by 푡푃 (푥). It is lower semicomputable by its very construction, using a program of length + Í < 퐾(푇) + 퐾(푃). Let us check that it satisfies 푥 푃(푥)푡푃 (푥) ≤ 1 which, in this notation, corresponds to inequality (2.2.1). We have ∑︁ ∑︁ ∑︁ 푃(푥)푡푃 (푥) = 푃(푥) 푇 (푥, 푦)푡푇∗푃 ( 푦) 푥 푥 푦 ∑︁ ∑︁ m( 푦) ∑︁ = 푃(푥) 푇 (푥, 푦) = m( 푦) ≤ 1. (푇∗푃)( 푦) 푥 푦 푦
+ It follows that 푑푃 (푥) = log 푡푃 (푥) is an integrable test and hence 푑푃 (푥) < d푃 (푥) + 퐾(푇) + 퐾(푃). (See the remark at the end of the proof of Theorem 2.2.2.)
Corollary 2.2.8 There is a constant 푐0 such that for every integer 푘 ≥ 0, for all 푥 we have
∑︁ −푘+푐0 {푇 (푥, 푦) : d푇∗푃 ( 푦) − d푃 (푥) > 푘 + 퐾(푇) + 퐾(푃) } ≤ 2 . Proof. The theorem says
∑︁ ∗ 퐾 (푇)+퐾 (푃) 푇 (푥, 푦)t푇∗푃 ( 푦) < 푡푃 (푥)2 . 푦
∗ ( ) Thus, it upperbounds the expected value of the function 2d푇 푃) 푦 according to the distribution 푇 (푥, ·). Applying Markov’s inequality (2.1.4) to this function yields the desired result.
2.3 Infinite sequences
These lecture notes treat the theory of randomness over continuous spaces mostly separately, starting in Section 4.1. But the case of computable measures over infi- nite sequences is particularly simple and appealing, so we give some of its results here, even if most follow from the more general results given later. In this section, we will fix a finite or countable alphabet Σ = {푠1, 푠2,...}, and consider probability distributions over the set
푋 = ΣN
46 2.3. Infinite sequences of infinite sequences with members in Σ. An alternative way of speaking of this is to consider a sequence of random variables 푋1, 푋2,..., where 푋푖 ∈ Σ, with a joint distribution. The two interesting extreme special cases are Σ = B = {0, 1}, giving the set of infinite 0-1 sequences, and Σ = N, giving the set sequences of natural numbers.
2.3.1 Null sets Our goal is here to illustrate the measure theory developed in Section A.2, through Subsection A.2.3, on the concrete example of the set of infinite sequences, and then to develop the theory of randomness in it. We will distinguish a few simple kinds of subsets of the set 푆 of sequences, those for which probability can be defined especially easily. Definition 2.3.1 • For string 푧 ∈ Σ∗, we will denote by 푧푋 the set of elements of 푋 with prefix 푧. Such sets will be called cylinder sets. • A subset of 푋 is open if it is the union of any number (not necessarily finite) of cylinder sets. It is called closed if its complement is open. • An open set is called constructive if it is the union of a recursively enumerable set of cylinder sets.
• A set is called 퐺훿 if it is the intersection of a sequence of open sets. It is called 퐹휎 if it is the union of a sequence of closed sets. • We say that a set 퐸 ⊆ 푋 is finitely determined if there is an 푛 and an E ⊆ Σ푛 such that Ø 퐸 = E푋 = 푠푋. 푠∈E
Let F be the class of all finitely determined subsets of 푋. • A class of subsets of 푋 is called an algebra if it is closed with respect to finite intersections and complements (and then of course, also with respect to finite unions). It is called a 휎-algebra (sigma-algebra) when it is also closed with respect to countable intersections (and then of course, also with respect to countable unions). y Example 2.3.2 An example open set that is not finitely determined, is the set 퐸 of all sequences that contain a substring 11. y
47 2. Randomness
The following observations are easy to prove. Proposition 2.3.3 a) Every open set 퐺 can be represented as the union of a sequence of disjoint cylinder sets. b) The set of open sets is closed with respect to finite intersection and arbitrarily large union. c) Each finitely determined set is both open and closed. d) The class F of finitely determined sets forms an algebra. Probability is generally defined as a nonnegative, monotonic function (a “mea- sure”, see later) on some subsets of the event space (in our case the set 푋) that has a certain additivity property. We do not need immediately the complete definition of measures, but let us say what we mean by an additive set function. Definition 2.3.4 Consider a nonnegative, function 휇 is defined over some sub- sets of 푋. that is also monotonic, that is 퐴 ⊆ 퐵 implies 휇(퐴) ⊆ 휇(퐵). We say that Ð푛 it is additive if, whenever it is defined on the sets 퐸1, . . . , 퐸푛, and on 퐸 = 푖=1 퐸푖 and the sets 퐸푖 are mutually disjoint, then we have 휇(퐸) = 휇(퐸1) + · · · + 휇(퐸푛). We say that 휇 is countably additive, or 휎-additive (sigma-additive), if in ad- Ð∞ dition whenever it is defined on the sets 퐸1, 퐸2,..., and on 퐸 = 푖=1 퐸푖 and the Í∞ sets 퐸푖 are mutually disjoint, then we have 휇(퐸) = 푖=1 휇(퐸푖). y First we consider only measures defined on cylinder sets. N ∗ Definition 2.3.5 (Measure over Σ ) Let 휇 : Σ → R+ be a function assigning a nonnegative number to each string in Σ∗. We will call 휇 a measure if it satisfies the condition ∑︁ 휇(푥) = 휇(푥푠). (2.3.1) 푠∈Σ
We will take the liberty to also write
휇(푠푋) = 휇(푠) for all 푠 ∈ Σ∗, and 휇(∅) = 0. We call 휇 a probability measure if 휇(Λ) = 1 and correspondingly, 휇(푋) = 1. ∗ A measure is called computable if it is computable as a function 휇 : Σ → R+ according to Definition 1.5.2. y The following observation lets us extend measures.
48 2.3. Infinite sequences
Lemma 2.3.6 Let 휇 be a measure defined as above. If a cylinder set 푠푋 is the Í disjoint union 푥푋 = 푥1푋 ∪푥2푋 ∪... of cylinder sets, then we have 휇(푧) = 푖 휇(푥푖). Proof. Let us call an arbitrary 푦 ∈ Σ∗ good if ∑︁ 휇(푧푦) = 휇(푧푦푋 ∩ 푥푖 푋) 푖 holds. We have to show that the empty string Λ is good. It is easy to see that 푦 is good if 푧푦푋 ⊆ 푥푖 푋 for some 푖. Also, if 푦푠 is good for all 푠 ∈ Σ then 푦 is good. Now assume that Λ is not good. Then there is also an 푠1 ∈ Σ that is not good. But then there is also an 푠2 ∈ Σ such that 푠1푠2 is not good. And so on, we obtain an infinite sequence 푠1푠2 ··· such that 푠1 ··· 푠푛 is Ð not good for any 푛. But then the sequence 푧푠1푠2 ··· is not contained in 푖 푥푖 푋, contrary to the assumption . Corollary 2.3.7 Two disjoint union representations of the same open set Ø Ø 푥푖 푋 = 푦푖 푋 푖 푖 Í Í imply 푖 휇(푥푖) = 푖 휇( 푦푖). Ð Proof. The set 푖 푥푖 푋 can also be written as Ø Ø 푥푖 푋 = 푥푖 푋 ∩ 푦푗 푋. 푖 푖, 푗
An element 푥푖 푋∩푦푗 푋 is nonempty only if one of the strings 푥푖, 푦푗 is a continuation of the other. Calling this string 푧푖 푗 we have 푥푖 푋 ∩ 푦푗 푋 = 푧푖 푗 푋. Now Lemma 2.3.6 Ð is applicable to the union 푥푖 푋 = 푗 (푥푖 푋 ∩ 푦푗 푋), giving ∑︁ 휇(푥푖) = 휇(푥푖 푋 ∩ 푦푗 푋), 푗 ∑︁ ∑︁ 휇(푥푖) = 휇(푥푖 푋 ∩ 푦푗 푋). 푖 푖, 푗 Í The right-hand side is also equal similarly to 푗 휇( 푦푗). The above corollary allows the following definition: Definition 2.3.8 For an open set 퐺 given as a disjoint union of cylinder sets Í 푥푖 푋 let 휇(퐺) = 푖 휇(푥푖). For a closed set 퐹 = 푋 \ 퐺 where 퐺 is open, let 휇(퐹) = 휇(푋) − 휇(퐺). y
49 2. Randomness
The following is easy to check. Proposition 2.3.9 The measure as defined on open sets is monotonic and countably additive. In particular, it is countably additive on the algebra of finitely determined sets. Before extending measures to a wider range of sets, consider an important special case. With infinite sequences, there are certain events that are not im- possible, but still have probability 0. Definition 2.3.10 Given a measure 휇, a set 푁 ⊆ 푋 is called a null set with respect to 휇 if there is a sequence 퐺1, 퐺2,... of open sets with the property Ñ −푚 푁 ⊆ 푚 퐺푚, and 휇(퐺푚) ≤ 2 . We will say that the sequence 퐺푚 witnesses the fact that 푁 is a null set. y Examples 2.3.11 Let Σ = {0, 1}, let us define the measure 휆 by 휆(푥) = 2−푛 for all 푛 and for all 푥 ∈ Σ푛. 1. For every infinite sequence 휉 ∈ 푋, the one-element set {휉} is a null set with respect to 휆. Indeed, for each natural number 푛, let 퐻푛 = {휉 ∈ 푋 : 휉(0) = −푛−1 휉(0), 휉(1) = 휉(1), . . . , 휉(푛) = 휉(푛)}. Then 휆(퐻푛) = 2 , and {푠} = Ñ 푛 퐻푛. 2. For a less trivial example, consider the set 퐸 of those elements 푡 ∈ 푋 that are 1 in each positive even position, that is
퐸 = {푡 ∈ 푋 : 푡(2) = 1, 푡(4) = 1, 푡(6) = 1 ...}.
Then 퐸 is a null set. Indeed, for each natural number 푛, let 퐺푛 = {푡 ∈ 푋 : Ñ 푡(2) = 1, 푡(4) = 1, . . . , 푡(2푛) = 1}. This helps expressing 퐸 as 퐸 = 푛 퐺푛, −푛 where 휆(퐺푛) = 2 . y
Proposition 2.3.12 Let 푁1, 푁2,... be a sequence of null sets with respect to a Ð measure 휇. Their union 푁 = 푖 푁푖 is also a null set.
Proof. Let 퐺푖,1, 퐺푖,2,... be the infinite sequence witnessing the fact that 푁푖 is a Ð∞ null set. Let 퐻푚 = 푖=1 퐺푖,푚+푖. Then the sequence 퐻푚 of open sets witnesses the fact that 푁 is a null set. We would like to extend our measure to null sets 푁 and say 휇(푁) = 0. The proposition we have just proved shows that such an extension would be count- ably additive on the null sets. But we do not need this extension for the moment,
50 2.3. Infinite sequences so we postpone it. Still, given a probability measure 푃 over 푋, it becomes mean- ingful to say that a certain property holds with probability 1. What this means is that the set of those sequences that do not have this property is a null set. Following the idea of Martin-Löf, we would like to call a sequence nonran- dom, if it is contained in some simple null set; that is it has some easily definable property with probability 0. As we have seen in part1 of Example 2.3.11, it is important to insist on simplicity, otherwise (say with respect to the measure 휆) every sequence might be contained in a null set, namely the one-element set con- sisting of itself. But most of these sets are not defined simply at all. An example of a simply defined null set is given is part2 of Example 2.3.11. These reflections justify the following definition, in which “simple” is specified as “constructive”. Definition 2.3.13 Let 휇 be a computable measure. A set 푁 ⊆ 푋 is called a constructive null set if there is recursively enumerable set Γ ⊆ N × Σ∗ with the = { ( ) ∈ } = Ð property that denoting Γ푚 푥 : 푚, 푥 Γ and 퐺푚 푥 ∈Γ푚 푥푋 we have Ñ −푚 푁 ⊆ 푚 퐺푚, and 휇(퐺푚) ≤ 2 . y In words, the difference between the definition of null sets and constructive = Ð null sets is that the sets 퐺푚 푥 ∈Γ푚 푥푋 here are required to be constructive open, moreover, in such a way that from 푚 one can compute the program gen- erating 퐺푚. In even looser words, a set is a constructive null set if for any 휀 > 0 one can construct effectively a union of cylinder sets containing it, with total measure ≤ 휀. Now we are in a position to define random infinite sequences. Definition 2.3.14 (Random sequence) An infinite sequence 휉 is random with respect to a probability measure 푃 if and only if 휉 is not contained in any con- structive null set with respect to 푃. y It is easy to see that the set of nonrandom sequences is a null set. Indeed, there is only a countable number or constructive null sets, so even their union is a null set. The following theorem strengthens this observation significantly. Theorem 2.3.1 Let us fix a computable probability measure 푃. The set of all nonrandom sequences is a constructive null set. Thus, there is a universal constructive null set, containing all other construc- tive null sets. A sequence is random when it does not belong to this set.
Proof of Theorem 2.3.1. The proof uses the projection technique that has ap- peared several times in this book, for example in proving the existence of a universal lower semicomputable semimeasure.
51 2. Randomness
We know that it is possible to list all recursively enumerable subsets of the set N × Σ∗ into a sequence, namely that there is a recursively enumerable set Δ ⊆ N2 × Σ∗ with the property that for every recursively enumerable set Γ ⊆ N × Σ∗ there is an 푒 with Γ = {(푚, 푥) : (푒, 푚, 푥) ∈ Δ}. We will write
Δ푒,푚 = {푥 : (푒, 푚, 푥) ∈ Δ}, Ø 퐷푒,푚 = 푥푋.
푥 ∈Δ푒,푚
We transform the set Δ into another set Δ0 with the following property. Í −푚+1 • For each 푒, 푚 we have (푒,푚,푥) ∈Δ0 휇(푥) ≤ 2 . Í −푚 • If for some 푒 we have (푒,푚,푥) ∈Δ 휇(푥) ≤ 2 for all 푚 then for all 푚 we have {푥 : (푒, 푚, 푥) ∈ Δ0 } = {푥 : (푒, 푚, 푥) ∈ Δ}. This transformation is routine, so we leave it to the reader. By the construction of Δ0, for every constructive null set 푁 there is an 푒 with
Ù 0 푁 ⊆ 퐷푒,푚. 푚
Define the recursively enumerable set
Γˆ = {(푚, 푥) : ∃ 푒 (푒, 푚 + 푒 + 2, 푥) ∈ Δ0 }.
ˆ Ð 0 Then 퐺푚 = 푒 퐷푒,푚+푒+2. For all 푚 we have ∑︁ ∑︁ ∑︁ ∑︁ 휇(푥) = 휇(푥) ≤ 2−푚−푒−1 = 2−푚. 푒 0 푒 푥 ∈Γˆ푚 (푒,푚+푒+2,푥) ∈Δ
This shows that Γˆ defines a constructive null set. Let Γ be any other recursively enumerable subset of N × Σ∗ that defines a constructive null set. Then there is 0 an 푒 such that for all 푚 we have 퐺푚 = 퐷푒,푚. The universality follows now from
Ù Ù 0 Ù 0 Ù ˆ 퐺푚 = 퐷푒,푚 ⊆ 퐷푒,푚+푒+2 ⊆ 퐺푚. 푚 푚 푚 푚
2.3.2 Probability space Now we are ready to extend measure to a much wider range of subsets of 푋.
52 2.3. Infinite sequences
Definition 2.3.15 Elements of the smallest 휎-algebra A containing the cylinder sets of 푋 are called the Borel sets of 푋. The pair (푋, A) is an example of a measureable space, where the Borel sets are called the measureable sets. A nonnegative sigma-additive function 휇 over A is called a measure. It is called a probability measure if 휇(푋) = 1. If 휇 is fixed then the triple (푋, A, 휇) is called a measure space. If it is a probability measure then the space is called a probability space. y In the Appendix we cited a central theorem of measure theory, Caratheodory’s extension theorem. It implies that if a measure is defined on an algebra L in a sigma-additive way then it can be extended uniquely to the 휎-algebra generated by L, that is the smallest 휎-algebra containing L. We defined a measure 휇 over ∗ 푋 as a nonnegative function 휇 : Σ → R+ satisfying the equality (2.3.1). Then we defined 휇(푥푋) = 휇(푥), and further extended 휇 in a 휎-additive way to all elements of the algebra F. Now Caratheodory’s theorem allows us to extend it uniquely to all Borel sets, and thus to define a measureable space (푋, A, 휇). Of course, all null sets in A get measure 0. Open, closed and measureable sets can also be defined in the set of real numbers. Definition 2.3.16 A subset 퐺 ⊆ R is open if it is the union of a set of open intervals (푎푖, 푏푖). It is closed if its complement is open. The set B of Borel sets of B is defined as the smallest 휎-algebra containing all open sets. y The pair (R, B) is another example of a measureable space. Definition 2.3.17 (Lebesgue measure) Consider the set left-closed intervals of the line (including intervals of the form (−∞, 푎). Let L be the set of finite disjoint unions of such intervals. This is an algebra. We define the function 휆 over L as Ð Í follows: 휆( 푖 [푎푖, 푏푖)) = 푖 푏푖 − 푎푖. It is easy to check that this is a 휎-additive measure and therefore by Caratheodory’s theorem can be extended to the set B of all Borel sets. This function is called the Lebesgue measure over R, giving us the measureable space (R, B, 휆). y Finally, we can define the notion of a measureable function over 푋. Definition 2.3.18 (Measureable functions) A function 푓 : 푋 → R is called measureable if and only if 푓 −1 (퐸) ∈ A for all 퐸 ∈ B. y The following is easy to prove. Proposition 2.3.19 Function 푓 : 푋 → R is measureable if and only if all sets of
53 2. Randomness the form 푓 −1 ((푟, ∞)) = {푥 : 푓 (푥) > 푟} are measureable, where 푟 is a rational number.
2.3.3 Computability Measureable functions are quite general; it is worth introducing some more resticted kinds of function over the set of sequences. Definition 2.3.20 A function 푓 : 푋 → R is continuous if and only if for every 푥 ∈ 푋, for every 휀 > 0 there is a cylinder set 퐶 3 푥 such that | 푓 ( 푦) − 푓 (푥)| < 휀 for all 푦 ∈ 퐶. A function 푓 : 푋 → R is lower semicontinuous if for every 푟 ∈ R the set {푥 ∈ 푋 : 푓 (푥) > 푟} is open. It is upper semicontinuous if − 푓 is lower semicontinuous. y The following is easy to verify. Proposition 2.3.21 A function 푓 : 푋 → R is continuous if and only if it is both upper and lower semicontinuous. Definition 2.3.22 A function 푓 : 푋 → R is computable if for every open rational interval (푢, 푣) the set 푓 −1 ((푢, 푣)) is a constructive open set of 푋, uniformly in 푢, 푣. y Informally this means that if for the infinite sequence 휉 ∈ 푋 we have 푢 < 푓 (휉) < 푣 then from 푢, 푣 sooner or later we will find a prefix 푥 of 휉 with the property 푢 < 푓 (푥푋) < 푣. Definition 2.3.23 A function 푓 : 푋 → R is lower semicomputable if for every rational 푟 the set {푠 ∈ 푋 : 푓 (푠) > 푟} is a constructive open set of 푋, uniformly in 푟. It is upper semicomputable if − 푓 is lower semicomputable. y The following is easy to verify. Proposition 2.3.24 A function 푋 → R is computable if and only if it is both lower and upper semicomputable.
2.3.4 Integral The definition of integral over a measure space is given in the Appendix, in Sub- section A.2.3. Here, we give an exposition specialized to infinite sequences. Definition 2.3.25 A measurable function 푓 : 푋 → R is called a step function if its range is finite. The set of step functions will be called E.
54 2.3. Infinite sequences
Given a step function 푓 which takes values 푥푖 on sets 퐴푖, and a finite measure 휇, we define ∫ ∫ ∑︁ 휇( 푓 ) = 휇 푓 = 푓 푑휇 = 푓 (푥)휇(푑푥) = 푥푖휇(퐴푖). 푖 y Proposition A.2.14, when specified to our situation here, says the following. Proposition 2.3.26 The functional 휇 defined above on step functions can be ex- tended to the set E+ of monotonic limits of nonnegative elements of E, by continuity. The set E+ is the set of all nonnegative measurable functions. Now we extend the notion of integral to a wider class of functions. Definition 2.3.27 A measurable function 푓 is called integrable with respect to a finite measure 휇 if 휇| 푓 |+ < ∞ and 휇| 푓 |− < ∞. In this case, we define 휇 푓 = 휇| 푓 |+ − 휇| 푓 |−. y It is easy to see that the mapping 푓 ↦→ 휇 푓 is linear when 푓 runs through the set of measureable functions with 휇| 푓 | < ∞. The following is also easy to check. Proposition 2.3.28 Let 휇 be a computable measure. a) If 푓 is computable then a program to compute 휇 푓 can be found from the pro- gram to compute 푓 . b) If 푓 is lower semicomputable then a program to lower semicompute 휇 푓 can be found from a program to lower semicompute 푓 .
2.3.5 Randomness tests We can now define randomness tests similarly to Section 2.2. Definition 2.3.29 A function 푑 : 푋 → R is an integrable test, or expectation- bounded test of randomness with respect to the probability measure 푃 if it is lower semicomputable and satisfies the condition ∫ 2푑 (푥) 푃(푑푥) ≤ 1. (2.3.2)
It is called a Martin-Löf test, or probability-bounded test, if instead of the latter condition only the weaker one is satisfied saying 푃(푑(푥) > 푚) < 2−푚 for each positive integer 푚.
55 2. Randomness
It is universal if it dominates all other integrable tests to within an additive constant. Universal Martin-Löf tests are defined in the same way. y Randomness tests and constructive null sets are closely related. Theorem 2.3.2 Let 푃 be a computable probability measure over the set 푆 = Σ∞. There is a correspondence between constructive null sets and randomness tests: a) For every randomness tests 푑, the set 푁 = {휉 : 푑(휉) = ∞} is a constructive null set. b) For every constructive null set 푁 there is a randomness test 푑 with 푁 = {휉 : 푑(휉) = ∞}.
Proof. Let 푑 be a randomness test: then for each 푘 the set 퐺푘 = {휉 : 푑(휉) > 푘} is a constructive open set, and by Markov’s inequality (wich is proved in the −푘 continuous case just as in the discrete case) we have 푃(퐺푘) ≤ 2 . The sets 퐺푘 witness that 푁 is a constructive null set. Ñ∞ Let 푁 be a constructive null set with 푁 ⊆ 푘=1 퐺푘, where 퐺푘 is a uniform −푘 sequence of constructive open sets with 푃(퐺푘) = 2 . Without loss of generality assume that the sequence 퐺푘 is decreasing. Then the function 푑(휉) = sup{푘 : −푘 휉 ∈ 퐺푘 } is lower semicomputable and satisfies 푃{휉 : 푑(휉) ≥ 푘} ≤ 2 , so it is a Martin-Löf test. Just as in Proposition 2.2.3, it is easy to check that 푑(푥) − 2 log 푑(푥) − 푐 is an integrable test for some constant 푐. Just as there is a universal constructive null set, there are universal random- ness tests. Theorem 2.3.3 For a computable measure 푃, there is a universal integrable test d푃 (휉). More exactly, the function 휉 ↦→ d푃 (휉) is lower semicomputable, satis- fies (2.2.1) and for all integrable tests 푑(휉) for 푃, we have
+ 푑(휉) < d푃 (휉) + 퐾(푑) + 퐾(푃).
The proof is similar to the proof Theorem 1.6.3 on the existence of a universal constructive semimeasure. It uses the projection technique and a weighted sum.
2.3.6 Randomness and complexity We hope for a result connecting complexity with randomness similar to Theo- rem 2.2.1. Somewhat surprisingly, there is indeed such a result. This is a sur- prise since in an infinite sequence, arbitrarily large oscillations of any quantity depending on the prefixes are actually to be expected.
56 2.3. Infinite sequences
Theorem 2.3.4 Let 푋 = ΣN be the set of infinite sequences. For all computable measures 휇 over 푋, we have
+ d휇 (휉) = sup (− log 휇(휉1:푛) − 퐾(휉1:푛)). (2.3.3) 푛
Here, the constant in =+ depends on the computable measure 휇. Corollary 2.3.30 Let 휆 be the uniform distribution over the set of infinite binary sequences. Then + d휆 (휉) = sup (푛 − 퐾(휉1:푛)). (2.3.4) 푛 In other words, an infinite binary sequence is random (with respect to the uniform distribution) if and only if the complexity 퐾(휉1:푛) of its initial segments of length 푛 never decreases below 푛 by more than an additive constant.
+ Proof. To prove <, define the function
∑︁ m(푠 | 휇) ∑︁ m(휉1:푛 | 휇) m(휉1:푛 | 휇) 푓휇 (휉) = 1푠푋 (휉) = ≥ sup . 휇(푠) 휇(휉 ) 휇(휉 ) 푠 푛 1:푛 푛 1:푛
휉 The function 휉 ↦→ 푓휇 (휉) is lower semicomputable with 휇 푓휇 (휉) ≤ 1, and hence
+ + d휇 (휉) > log 푓 (휉) > sup (− log 휇(휉1:푛) − 퐾(휉1:푛 | 휇)). 푛
+ The proof of < reproduces the proof of Theorem 5.2 of [18]. Let us abbrevi- ate: 1푦 (휉) = 1푦푋 (휉). Since d푑(휉)e only takes integer values and is lower semicomputable, there are ∗ computable sequences 푦푖 ∈ Σ and 푘푖 ∈ N with 1 d푑 (휉) e 푘푖 ∑︁ 푘푖 2 = sup 2 1푦 (휉) ≥ 2 1푦 (휉) 푖 2 푖 푖 푖 with the property that if 푖 < 푗 and 1푦푖 (휉) = 1푦푗 (휉) = 1 then 푘푖 < 푘 푗. The inequality follows from the fact that for any finite sequence 푛1 < 푛2 < . . ., Í 푛 푗 ≤ 푛 푗 ( ) = Í 푘푖 푗 2 2 max 푗 2 . The function 훾 푦 푦푖=푦 2 is lower semicomputable. With it, we have
∑︁ 푘푖 ∑︁ 2 1푦푖 (휉) = 1푦 (휉)훾( 푦). 푖 푦 ∈Σ∗
57 2. Randomness
d푑e Í ∗ Since 휇2 ≤ 2, we have 푦 휇( 푦)훾( 푦) ≤ 2, hence 휇( 푦)훾( 푦) < m( 푦), that is ∗ 훾( 푦) < m( 푦)/휇( 푦). It follows that
푑 (휉) ∗ m( 푦) m(휉1:푛) 2 < sup 1푦 (휉) = sup . 푦 ∈Σ∗ 휇( 푦) 푛 휇(휉1:푛)
+ Taking logarithms, we obtain the < part of the theorem.
2.3.7 Universal semimeasure, algorithmic probability Universal semimeasures exist over infinite sequences just as in the discrete space. ∗ Definition 2.3.31 A function 휇 : Σ → R+ is a semimeasure if it satisfies ∑︁ 휇(푥) ≥ 휇(푥푠), 푠∈Σ 휇(Λ) ≤ 1.
If it is lower semicomputable we will call the semimeasure constructive. y These requirements imply, just as for measures, the following generalization to an arbitrary prefix-free set 퐴 ⊆ Σ∗: ∑︁ 휇(푥) ≤ 1. 푥 ∈퐴 Standard technique gives the following: Theorem 2.3.5 (Universal semimeasure) There is a universal constructive semimea- sure 휇, that is a constructive semimeasure with the property that for every other constructive semimeasure 휈 there is a constant 푐휈 > 0 with 휇(푥) ≥ 푐휈휈(푥). Definition 2.3.32 Let us fix a universal constructive semimeasure over 푋 and denote it by 푀(푥). y There is a graphical way to represent a universal semimeasure, via mono- tonic Turing machines, which can be viewed a generalization of self-delimiting machines. Definition 2.3.33 A Turing machine T is monotonic if it has no input tape or output tape, but can ask repeatedly for input symbols and can emit repeatedly output symbols. The input is assumed to be an infinite binary string, but it is not assumed that T will ask for all of its symbols, even in infinite time. If the finite or infinite input string is 푝 = 푝1 푝2 ···, the (finite or infinite) output string is written as 푇 ( 푝) ∈ Σ∗ ∪ ΣN. y
58 2.3. Infinite sequences
Just as we can generalize self-delimiting machines to obtain the notion of monotonic machines, we can generalize algorithmic probability, defined for a discrete space, to a version defined for infinite sequences. Imagine the monotonic Turing machine T computing the function 푇 ( 푝) as in the definition, and assume that its input symbols come from coin-tossing. Definition 2.3.34 (Algorithmic probability over infinite sequences) For a string ∗ 푥 ∈ Σ , let 푃푇 (푥) be the probability of 푇 (휋) w 푥, when 휋 = 휋1휋2 ··· is the infinite coin-tossing sequence. y We can compute
∑︁ −푙( 푝) 푃푇 (푥) = 2 , 푇 ( 푝) w푥,푝 minimal where “minimal” means that no prefix 푝0 of 푝 gives 푇 ( 푝0) w 푥. This expression shows that 푃푇 (푥) is lower semicomputable. It is also easy to check that it is a semimeasure, so we have a constructive semimeasure. The following theorem is obtained using a standard construction: Theorem 2.3.6 Every constructive semimeasure 휇(푥) can be represented as 휇(푥) = 푃푇 (푥) with the help of an appropriate a monotonic machine T. This theorem justifies the following. Definition 2.3.35 From now on we will call the universal semimeasure 푀(푥) also the algorithmic probability. A monotonic Turing machine T giving 푃푇 (푥) = 푀(푥) will be called an optimal machine. y We will see that − log 푀(푥) should also be considered a kind of description complexity: Definition 2.3.36 Denote 퐾푀(푥) = − log 푀(푥). y How does 퐾푀(푥) compare to 퐾(푥)? Proposition 2.3.37 We have the bounds + + 퐾푀(푥) < 퐾(푥) < 퐾푀(푥) + 퐾(푙(푥)).
+ Proof. To prove 퐾푀(푥) < 퐾(푥) define the lower semicomputable function ∑︁ 휇(푥) = sup m(푥 푦), 푆 푦 ∈푆
59 2. Randomness
By its definition this is a constructive semimeasure and it is ≥ m(푥). This implies ∗ + 푀(푥) > m(푥), and thus 퐾푀(푥) < 퐾(푥). For the other direction, define the following lower semicomputable function 휈 over Σ∗:
휈(푥) = m(푙(푥))· 푀(푥).
To show that it is a semimeasure compute: ∑︁ ∑︁ ∑︁ ∑︁ 휈(푥) = m(푛) 푀(푥) ≤ m(푛) ≤ 1. 푥 ∈Σ∗ 푛 푥 ∈Σ푛 푛
∗ + It follows that m(푥) > 휈(푥) and hence 퐾(푥) < 퐾푀(푥) + 퐾(푙(푥)).
2.3.8 Randomness via algorithmic probability Fix a computable measure 푃(푥) over the space 푋. Just as in the discrete case, the universal semimeasure gives rise to a Martin-Löf test. Definition 2.3.38 Denote
0 푀(휉1:푛) d푃 (휉) = log sup . 푛 푃(휉1:푛) y ∗ By Proposition 2.3.37, this function is is > d푃 (휉), defined in Theorem 2.3.3. It can be shown that it is no longer expectation-bounded. But the theorem below shows that it is still a Martin-Löf (probability-bounded) test, so it defines the same infinite random sequences. 0 Theorem 2.3.7 The function d푃 (휉) is a Martin-Löf test. Proof. Lower semicomputability follows from the form of definition. By standard methods, produce a list of finite strings 푦1, 푦2,... with: 푚 a) 푀( 푦푖)/푃( 푦푖) > 2
b) The cylinder sets 푦푖 푋 cover of 퐺푚, and are disjoint (thus the set {푦1, 푦2,... } is prefix-free). We have
∑︁ −푚 ∑︁ −푚 푃(퐺푚) = 푃( 푦푖) < 2 푀( 푦푖) ≤ 2 . 푖 푖
60 2.3. Infinite sequences
Just as in the discrete case, the universality of 푀(푥) implies
+ 퐾푀(푥) < − log 푃(푥) + 퐾(푃).
On the other hand,
sup − log 푃(휉1:푛) − 퐾푀(휉1:푛) 푛 is a randomness test. So for random 휉, the value 퐾푀(휉1:푛) remains within a constant of − log 푃(푥): it does not oscillate much.
61
3 Information
3.1 Information-theoretic relations
In this subsection, we use some material from [23].
3.1.1 The information-theoretic identity Information-theoretic entropy is related to complexity in both a formal and a more substantial way.
Classical entropy and classical coding Entropy is used in information theory in order to characterize the compressibility of a statistical source. Definition 3.1.1 The entropy of a discrete probability distribution 푃 is defined as ∑︁ H(푃) = − 푃(푥) log 푃(푥). 푥 y A simple argument shows that this quantity is non-negative. It is instructive to recall a more general inequality, taken from [13], implying it: Í Í Theorem 3.1.1 Let 푎푖, 푏푖 > 0, 푎 = 푖 푎푖, 푏 = 푖 푏푖. Then we have
∑︁ 푎푖 푎 푎푖 log ≥ 푎 log , (3.1.1) 푏 푏 푖 푖 Í Í with equality only if 푎푖/푏푖 is constant. In particular, if 푖 푎푖 = 1 and 푖 푏푖 ≤ 1 Í 푎푖 then we have 푎푖 log ≥ 0. 푖 푏푖 Proof. Exercise, an application of Jensen’s inequality to the concave function log 푥.
63 3. Information
Í If 푃 is a probability distribution and 푄(푥) ≥ 0, 푥 푄(푥) ≤ 1 then this inequality implies ∑︁ 퐾(푃) ≤ 푃(푥) log 푄(푥). 푥 We can interpret this inequality as follows. We have seen in Subsection 1.6.3 that if 푥 ↦→ 푧푥 is a prefix code, mapping the objects 푥 into binary strings 푧푥 that are Í −푙(푧푥 ) −푙(푧푥 ) not prefixes of each other then 푥 2 ≤ 1. Thus, substituting 푄(푥) = 2 above we obtain Shannon’s result:
Theorem 3.1.2 (Shannon) If 푥 ↦→ 푧푥 is a prefix code then for its expected code- word length we have the lower bound: ∑︁ 푃(푥)푙(푧푥) ≥ 퐾(푃). 푥
Entropy as expected complexity
Applying Shannon’s theorem 3.1.2 to the code obtained by taking 푧푥 as the short- est self-delimiting description of 푥, we obtain the inequality ∑︁ 푃(푥)퐾(푥) ≥ H(푃) 푥
∗ On the other hand, since m(푥) is a universal semimeasure, we have m(푥) > + 푃(푥): more precisely, 퐾(푥) < − log 푃(푥) + 퐾(푃), leading to the following theo- rem. Theorem 3.1.3 For a computable distribution 푃 we have
∑︁ + H(푃) ≤ 푃(푥)퐾(푥) < H(푃) + 퐾(푃). (3.1.2) 푥 Note that H(푃) is the entropy of the distribution 푃 while 퐾(푃) is just the description complexity of the function 푥 ↦→ 푃(푥), it has nothing to do directly with the magnitudes of the values 푃(푥) for each 푥 as real numbers. These two relations give + ∑︁ 퐾(푃) = 푃(푥)퐾(푥) − 퐾(푃), 푥 the entropy is within an additive constant equal to the expected complexity. Our intended interpretation of 퐾(푥) as information content of the individual object 푥 is thus supported by a tight quantitative relationship to Shannon’s statistical concept.
64 3.1. Information-theoretic relations
Identities, inequalities The statistical entropy obeys meaningful identities which immediately follow from the definition of conditional entropy. Definition 3.1.2 Let 푋 and 푌 be two discrete random variables with a joint distribution. The conditional entropy of 푌 with respect to 푋 is defined as ∑︁ ∑︁ H(푌 | 푋) = − 푃[푋 = 푥] 푃[푌 = 푦 | 푋 = 푥] log 푃[푌 = 푦 | 푋 = 푥]. 푥 푦 The joint entropy of 푋 and 푌 is defined as the entropy of the pair (푋, 푌). y The following additivity property is then verifiable by simple calculation:
H(푋, 푌) = H(푋) + H(푌 | 푋). (3.1.3)
Its meaning is that the amount of information in the pair (푋, 푌) is equal to the amount of information in 푋 plus the amount of our residual uncertainty about 푌 once the value of 푋 is known. Though intuitively expected, the following identity needs proof:
H(푌 | 푋) ≤ H(푌).
We will prove it after we define the difference below. Definition 3.1.3 The information in 푋 about 푌 is defined by
I(푋 : 푌) = H(푌) − H(푌 | 푋).
y This is the amount by which our knowledge of 푋 decreases our uncertainty about 푌. The identity below comes from equation (3.1.3).
I(푋 : 푌) = I(푌 : 푋) = H(푋) + H(푌) − H(푋, 푌)
The meaning of the symmetry relation is that the quantity of information in 푌 about 푋 is equal to the quantity of information in 푋 about 푌. Despite its simple derivation, this fact is not very intuitive; especially since only the quantities are equal, the actual contents are generally not (see [22]). We also call I(푋 : 푌) the mutual information of 푋 and 푌. We can also write ∑︁ 푃[푋 = 푥, 푌 = 푦] I(푋 : 푌) = 푃[푋 = 푥, 푌 = 푦] log . (3.1.4) 푃[푋 = 푥]푃[푌 = 푦] 푥,푦
65 3. Information
Proposition 3.1.4 The information I(푋 : 푌) is nonnegative and is 0 only if 푋, 푌 are independent.
Proof. Apply Theorem 3.1.1 to (3.1.4). This is 0 only if 푃[푋 = 푥, 푌 = 푦] = 푃[푋 = 푥]푃[푌 = 푦] for all 푥, 푦.
The algorithmic addition theorem For the notion of algorithmic entropy 퐾(푥) defined in Section 1.6, it is a seri- ous test of fitness to ask whether it has an additivity property analogous to the identity (3.1.3). Such an identity would express some deeper relationship than the trivial identity (3.1.3) since the conditional complexity is defined using the notion of conditional computation, and not by an algebraic formula involving the unconditional definition. A literal translation of the identity (3.1.3) turns out to be true for the function 퐾(푥) as well as for Kolmogorov’s complexity 퐶(푥) with good approximation. But the exact additivity property of algorithmic entropy is subtler. Theorem 3.1.4 (Addition) We have
퐾(푥, 푦) =+ 퐾(푥) + 퐾( 푦 | 푥, 퐾(푥)). (3.1.5)
+ Proof. We first prove <. Let 퐹 be the following s.d. interpreter. The machine computing 퐹( 푝) tries to parse 푝 into 푝 = 푢푣 so that 푇 (푢) is defined, then outputs the pair (푇 (푢), 푇 (푣, h푇 (푢), 푙(푢)i)). If 푢 is a shortest description of 푥 and 푣 a shortest description of 푦 given the pair (푥, 퐾(푥)) then the output of 퐹 is the pair (푥, 푦). + Now we prove >. Let 푐 be a constant (existing in force of Lemma 1.7.3) such Í −퐾 (푥,푦) −퐾 (푥)+푐 that for all 푥 we have 푦 2 ≤ 2 . Let us define the semicomputable function 푓 (푧, 푦) by 푓 ((푥, 푘), 푦) = 2푘−퐾 (푥,푦)−푐. Í Then for 푘 = 퐾(푥) we have 푦 푓 (푧, 푦) ≤ 1, hence the generalization (1.6.13) + of the coding theorem implies 퐾( 푦 | 푥, 퐾(푥)) < 퐾(푥, 푦) − 퐾(푥). Recall the definition of 푥∗ as the first shortest description of 푥. Corollary 3.1.5 We have
퐾(푥∗, 푦) =+ 퐾(푥, 푦). (3.1.6)
66 3.1. Information-theoretic relations
The Addition Theorem implies
+ + 퐾(푥, 푦) < 퐾(푥) + 퐾( 푦 | 푥) < 퐾(푥) + 퐾( 푦) while these relations do not hold for Kolmogorov’s complexity 퐶(푥), as pointed out in the discussion of Theorem 1.4.2. For an ordinary interpreter, extra infor- mation is needed to separate the descriptions of 푥 and 푦. The self-delimiting interpreter accepts only a description that provides the neccesary information to determine its end. The term 퐾(푥) cannot be omitted from the condition in 퐾( 푦 | 푥, 퐾(푥)) in the identity (3.1.5). Indeed, it follows from equation (1.7.2) that with 푦 = 퐾(푥), we have 퐾(푥, 푦) − 퐾(푥) =+ 0. But 퐾(퐾(푥) | 푥) can be quite large, as shown in the following theorem given here without proof. Theorem 3.1.5 (see [17]) There is a constant 푐 such that for all 푛 there is a binary string 푥 of length 푛 with
퐾(퐾(푥) | 푥) ≥ log 푛 − log log 푛 − 푐.
We will prove this theorem in Section 3.3. For the study of information rela- tions, let us introduce yet another notation. Definition 3.1.6 For any functions 푓 , 푔, ℎ with values in S, let
푓 푔 mod ℎ mean that there is a constant 푐 such that 퐾( 푓 (푥) | 푔(푥), ℎ(푥)) ≤ 푐 for all 푥. We will omit modℎ if ℎ is not in the condition. The sign will mean inequality in both directions. y The meaning of 푥 푦 is that the objects 푥 and 푦 hold essentially the same information. Theorem 1.4.1 implies that if 푓 푔 then we can replace 푓 by 푔 wherever it occurs as an argument of the functions 퐻 or 퐾. As an illustration of this definition, let us prove the relation
푥∗ h푥, 퐾(푥)i which implies 퐾( 푦 | 푥, 퐾(푥)) =+ 퐾( 푦 | 푥∗), and hence
퐾(푥, 푦) − 퐾(푥) =+ 퐾( 푦 | 푥∗). (3.1.7)
The direction holds because we can compute 푥 = 푇 (푥∗) and 퐾(푥) = 푙(푥∗) from 푥∗. On the other hand, if we have 푥 and 퐾(푥) then we can enumerate the
67 3. Information set 퐸(푥) of all shortest descriptions of 푥. By the corollary of Theorem 1.7.1, the number of elements of this set is bounded by a constant. The estimate (1.6.4) implies 퐾(푥∗ | 푥, 퐾(푥)) =+ 0. Though the string 푥∗ and the pair (푥, 퐾(푥)) contain the same information, they are not equivalent under some stronger criteria on the decoder. To obtain 푥 from the shortest description may take extremely long time. But even if this time 푡 is acceptable, to find any description of length 퐾(푥) for 푥 might require about 푡2퐾 (푥) steps of search. The Addition Theorem can be made formally analogous to its information- theoretic couterpart using Chaitin’s notational trick. Definition 3.1.7 Let 퐾∗( 푦 | 푥) = 퐾( 푦 | 푥∗). y Of course, 퐾∗(푥) = 퐾(푥). Now we can formulate the relation (3.1.7) as
퐾∗(푥, 푦) =+ 퐾∗(푥) + 퐾∗( 푦 | 푥).
However, we prefer to use the function 퐾( 푦 | 푥) since the function 퐾∗( 푦 | 푥) is not semicomputable. Indeed, if there was a function ℎ(푥, 푦) upper semicomputable in h푥, 푦i such that 퐾(푥, 푦) − 퐾(푥) =+ ℎ(푥, 푦) then by the argument of the proof + of the Addition Theorem, we would have 퐾( 푦 | 푥) < ℎ(푥, 푦) which is not true, according to remark after the proof of the Addition Theorem. Definition 3.1.8 We define the algorithmic mutual information with the same formula as classical information:
퐼(푥 : 푦) = 퐾( 푦) − 퐾( 푦 | 푥). y Since the quantity 퐼(푥 : 푦) is not quite equal to 퐾(푥) − 퐾(푥 | 푦) (in fact, Levin showed (see [17]) that they are not even asymptotically equal), we prefer to define information also by a symmetrical formula. Definition 3.1.9 The mutual information of two objects 푥 and 푦 is
퐼∗(푥 : 푦) = 퐾(푥) + 퐾( 푦) − 퐾(푥, 푦). y The Addition Theorem implies the identity
퐼∗(푥 : 푦) =+ 퐼(푥∗ : 푦) =+ 퐼( 푦∗ : 푥).
Let us show that classical information is indeed the expected value of algorithmic information, for a computable probability distribution 푝:
68 3.1. Information-theoretic relations
Lemma 3.1.10 Given a computable joint probability mass distribution 푝(푥, 푦) over (푥, 푦) we have
+ ∑︁ ∑︁ I(푋 : 푌) − 퐾( 푝) < 푝(푥, 푦)퐼∗(푥 : 푦) (3.1.8) 푥 푦 + < I(푋 : 푌) + 2퐾( 푝), where 퐾( 푝) is the length of the shortest prefix-free program that computes 푝(푥, 푦) from input (푥, 푦). Proof. We have
∑︁ + ∑︁ 푝(푥, 푦)퐼∗(푥 : 푦) = 푝(푥, 푦)[퐾(푥) + 퐾( 푦) − 퐾(푥, 푦)]. 푥,푦 푥,푦 Í Í Define 푦 푝(푥, 푦) = 푝1 (푥) and 푥 푝(푥, 푦) = 푝2 ( 푦) to obtain
∑︁ + ∑︁ ∑︁ ∑︁ 푝(푥, 푦)퐼(푥 : 푦) = 푝1 (푥)퐾(푥) + 푝2 ( 푦)퐾( 푦) − 푝(푥, 푦)퐾(푥, 푦). 푥,푦 푥 푦 푥,푦
The distributions 푝푖 (푖 = 1, 2) are computable. We have seen in (3.1.2) that + Í + H(푞) < 푥 푞(푥)퐾(푥) < H(푞) + 퐾(푞). + Í + + Hence, H( 푝푖) < 푥 푝푖 (푥)퐾(푥) < H( 푝푖) + 퐾( 푝푖).(푖 = 1, 2), and H( 푝) < Í + 푥,푦 푝(푥, 푦)퐾(푥, 푦) < H( 푝)+퐾( 푝). On the other hand, the probabilistic mutual information is expressed in the entropies by I(푋 : 푌) = H( 푝1) + H( 푝2) − H( 푝). + By construction of the 푞푖’s above, we have 퐾( 푝1), 퐾( 푝2) < 퐾( 푝). Since the complexities are positive, substitution establishes the lemma. Remark 3.1.11 The information 퐼∗(푥 : 푦) can be written as m(푥, 푦) 퐼∗(푥 : 푦) =+ log . m(푥)m( 푦)
∗ Formally, 퐼 (푥 : 푦) looks like dm×m ((푥, 푦) | m × m) with the function d· (·) in- troduced in (2.2.2). Thus, it looks like 퐼∗(푥 : 푦) measures the deficiency of randomness with respect to the distribution m × m. The latter distribution ex- presses our “hypothesis” that 푥, 푦 are “independent” from each other. There is a serious technical problem with this interpretation: the function d푃 (푥) was only defined for computable measures 푃. Though the definition can be extended, it is not clear at all that the expression d푃 (푥) will also play the role of universal test for arbitrary non-computable distributions. Levin’s theory culminating in [33] develops this idea, and we return to it in later parts of these notes. y
69 3. Information
Let us examine the size of the defects of naive additivity and information symmetry. Corollary 3.1.12 For the defect of additivity, we have
+ 퐾(푥) + 퐾( 푦 | 푥) − 퐾(푥, 푦) =+ 퐼(퐾(푥) : 푦 | 푥) < 퐾(퐾(푥) | 푥). (3.1.9) For information asymmetry, we have 퐼(푥 : 푦) − 퐼( 푦 : 푥) =+ 퐼(퐾( 푦) : 푥 | 푦) − 퐼(퐾(푥) : 푦 | 푥). (3.1.10) Proof. Immediate from the addition theorem. Theorem 3.1.6 (Levin) The information 퐼(푥 : 푦) is not even asymptotically sym- metric. Proof. For some constant 푐, assume |푥| = 푛 and 퐾(퐾(푥) | 푥) ≥ log 푛−log log 푛− 푐. Consider 푥, 푥∗ and 퐾(푥). Then we can check immediately the relations
+ + 퐼(푥 : 퐾(푥)) < 3 log log 푛 < log 푛 − log log 푛 − 푐 < 퐼(푥∗ : 퐾(푥)). and 퐼(퐾(푥) : 푥) =+ 퐼(퐾(푥) : 푥∗). It follows that symmetry breaks down in an exponentially great measure either on the pair (푥, 퐾(푥)) or the pair (푥∗, 퐾(푥)).
Data processing The following useful identity is also classical, and is called the data processing identity: I(푍 : (푋, 푌)) = I(푍 : 푋) + I(푍 : 푌 | 푋). (3.1.11) Let us see what corresponds to this for algorithmic information. Here is the correct conditional version of the addition theorem: Theorem 3.1.7 We have 퐾(푥, 푦 | 푧) =+ 퐾(푥 | 푧) + 퐾( 푦 | 푥, 퐾(푥 | 푧), 푧). (3.1.12) Proof. The same as for the unconditional version. Remark 3.1.13 The reader may have guessed the inequality 퐾(푥, 푦 | 푧) =+ 퐾(푥 | 푧) + 퐾( 푦 | 푥, 퐾(푥), 푧), but it is incorrect: taking 푧 = 푥, 푦 = 퐾(푥), the left-hand side equals 퐾(푥∗ | 푥), and the right-hand side equals 퐾(푥 | 푥) + 퐾(퐾(푥) | 푥∗, 푥) =+ 0. y
70 3.1. Information-theoretic relations
The following inequality is a useful tool. It shows that conditional complexity is a kind of one-sided distance, and it is the algorithmic analogue of the well- known (and not difficult classical inequality)
H(푋 | 푌) ≤ H(푍 | 푌) + H(푋 | 푍).
Theorem 3.1.8 (Simple triangle inequality) We have
+ + 퐾(푧 | 푥) < 퐾( 푦, 푧 | 푥) < 퐾( 푦 | 푥) + 퐾(푧 | 푦). (3.1.13)
Proof. According to (3.1.12), we have
퐾( 푦, 푧 | 푥) =+ 퐾( 푦 | 푥) + 퐾(푧 | 푦, 퐾( 푦 | 푥), 푥).
+ + The left-hand side is > 퐾(푧 | 푥), the second term of the right-hand side is < 퐾(푧 | 푦). Equation (3.1.12) justifies the following. Definition 3.1.14
퐼∗(푥 : 푦 | 푧) = 퐾(푥 | 푧) + 퐾( 푦 | 푧) − 퐾(푥, 푦 | 푧). (3.1.14)
y Then we have
퐼∗(푥 : 푦 | 푧) =+ 퐾( 푦 | 푧) − 퐾( 푦 | 푥, 퐾(푥 | 푧), 푧) =+ 퐾(푥 | 푧) − 퐾(푥 | 푦, 퐾( 푦 | 푧), 푧).
Theorem 3.1.9 (Algorithmic data processing identity) We have
퐼∗(푧 : (푥, 푦)) =+ 퐼∗(푧 : 푥) + 퐼∗(푧 : 푦 | 푥∗). (3.1.15)
Proof. We have
퐼∗(푧 : (푥, 푦)) =+ 퐾(푥, 푦) + 퐾(푧) − 퐾(푥, 푦, 푧) 퐼∗(푧 : 푥) =+ 퐾(푥) + 퐾(푧) − 퐾(푥, 푧) 퐼∗(푧 : (푥, 푦)) − 퐼∗(푧 : 푥) =+ 퐾(푥, 푦) + 퐾(푥, 푧) − 퐾(푥, 푦, 푧) − 퐾(푥) =+ 퐾( 푦 | 푥∗) + 퐾(푧 | 푥∗) − 퐾( 푦, 푧 | 푥∗) =+ 퐼∗(푧 : 푦 | 푥∗), where we used additivity and the definition (3.1.14) repeatedly.
71 3. Information
Corollary 3.1.15 The information 퐼∗ is monotonic:
+ 퐼∗(푥 : ( 푦, 푧)) > 퐼∗(푥 : 푦). (3.1.16)
The following theorem is also sometimes useful. Theorem 3.1.10 (Starry triangle inequality) For all 푥, 푦, 푧, we have
+ + 퐾(푥 | 푦∗) < 퐾(푥, 푧 | 푦∗) < 퐾(푧 | 푦∗) + 퐾(푥 | 푧∗). (3.1.17)
Proof. Using the algorithmic data-processing identity (3.1.15) and the defini- tions, we have
퐾(푧) − 퐾(푧 | 푦∗) + 퐾(푥 | 푧∗) − 퐾(푥 | 푦, 퐾( 푦 | 푧∗), 푧∗) =+ 퐼∗( 푦 : 푧) + 퐼∗( 푦 : 푥 | 푧∗) =+ 퐼∗( 푦 : (푥, 푧)) =+ 퐾(푥, 푧) − 퐾(푥, 푧 | 푦∗), which gives, after cancelling 퐾(푧) + 퐾(푥 | 푧∗) on the left-hand side with 퐾(푥, 푧) on the right-hand side, to which it is equal according to additivity, and changing sign: 퐾(푥, 푧 | 푦∗) =+ 퐾(푧 | 푦∗) + 퐾(푥 | 푦, 퐾( 푦 | 푧∗), 푧∗). From here, we proceed as in the proof of the simple triangle inequality (3.1.13)
3.1.2 Information non-increase In this subsection, we somewhat follow the exposition of [23]. The classical data processing identity has a simple but important application. Suppose that the random variables 푍, 푋, 푌 form a Markov chain in this order. Then 퐼(푍 : 푌 | 푋) = 0 and hence 퐼(푍 : (푋, 푌)) = 퐼(푍 : 푋). In words: all information in 푍 about 푌 is coming through 푋: a Markov transition from 푋 to 푌 cannot increase the information about 푍. Let us try to find the algorithmic counterpart of this theorem. Here, we rigorously show that this is the case in the algorithmic statistics setting: the information in one object about another cannot be increased by any deterministic algorithmic method by more than a constant. With added randomization this holds with overwhelming probability. For more elaborate versions see [31, 33].
72 3.1. Information-theoretic relations
Suppose we want to obtain information about a certain object 푥. It does not seem to be a good policy to guess blindly. This is confirmed by the following inequality. ∑︁ ∗ m( 푦)2퐼 (푥:푦) < 1 (3.1.18) 푦 which says that for each 푥, the expected value of 2퐼 (푥:푦) is small, even with respect to the universal constructive semimeasure m( 푦). The proof is immediate if we recognize that by the Coding Theorem,
2−퐾 ( 푦 | 푥) 2퐼 (푥:푦) =∗ m( 푦) and use the estimate (1.6.10). We prove a strong version of the information non-increase law under deter- ministic processing (later we need the attached corollary): Theorem 3.1.11 Given 푥 and 푧, let 푞 be a program computing 푧 from 푥∗. Then we have + 퐼∗( 푦 : 푧) < 퐼∗( 푦 : 푥) + 퐾(푞). Proof. By monotonicity (3.1.16) and the data processing identity (3.1.15):
+ 퐼∗( 푦 : 푧) < 퐼∗( 푦 : (푥, 푧)) =+ 퐼∗( 푦 : 푥) + 퐼∗( 푦 : 푧 | 푥∗).
By definition of information and the definition of 푞:
+ + 퐼∗( 푦 : 푧 | 푥∗) < 퐾(푧 | 푥∗) < 퐾(푞).
Randomized computation can increase information only with negligible prob- ability. Suppose that 푧 is obtained from 푥 by some randomized computation. The probability 푝(푧 | 푥) of obtaining 푧 from 푥 is a semicomputable distribution over the 푧’s. The information increase 퐼∗(푧 : 푦) − 퐼∗(푥 : 푦) satisfies the theorem below. Theorem 3.1.12 For all 푥, 푦, 푧 we have
∗ ∗ ∗ m(푧 | 푥∗)2퐼 (푧:푦)−퐼 (푥:푦) < m(푧 | 푥∗, 푦, 퐾( 푦 | 푥∗)).
∗ Therefore it is upperbounded by m(푧 | 푥) < m(푧 | 푥∗).
73 3. Information
Remark 3.1.16 The theorem implies
∑︁ ∗ ∗ ∗ m(푧 | 푥∗)2퐼 (푧:푦)−퐼 (푥:푦) < 1. (3.1.19) 푧
This says that the m(· | 푥∗)-expectation of the exponential of the increase is bounded by a constant. It follows that for example, the probability of an in- ∗ crease of mutual information by the amount 푑 is < 2−푑. y Proof. We have, by the data processing inequality:
퐼∗( 푦 : (푥, 푧)) − 퐼∗( 푦 : 푥) =+ 퐼∗( 푦 : 푧 | 푥∗) =+ 퐾(푧 | 푥∗) − 퐾(푧 | 푦, 퐾( 푦 | 푥∗), 푥∗).
+ Hence, using also 퐼∗( 푦 : (푥, 푧)) > 퐼∗( 푦 : 푧) by monotonicity:
+ 퐼∗( 푦 : 푧) − 퐼∗( 푦 : 푥) − 퐾(푧 | 푥∗) < −퐾(푧 | 푦, 퐾( 푦 | 푥∗), 푥∗).
Putting both sides into the exponent gives the statement of the theorem. Remark 3.1.17 The theorem on information non-increase and its proof look similar to the theorems on randomness conservation. There is a formal connec- tion, via the observation made in Remark 3.1.11. Due to the difficulties men- tioned there, we will explore the connection only later in our notes. y
3.2 The complexity of decidable and enumerable sets
If 휔 = 휔(1)휔(2)··· is an infinite binary sequence, then we may be interested in how the complexity of the initial segments
휔(1 : 푛) = 휔(1)··· 휔(푛) grows with 푛. We would guess that if 휔 is “random” then 퐶(휔(1 : 푛)) is approx- imately 푛; this question will be studied satisfactorily in later sections. Here, we want to look at some other questions. Can we determine whether the sequence 휔 is computable, by just looking at the complexity of its initial segments? It is easy to see that if 휔 is computable + + then 퐶(휔(1 : 푛)) < log 푛. But of course, if 퐶(휔(1 : 푛)) < log 푛 then it is not necessarily true yet that 휔 is computable. Maybe, we should put 푛 into + the condition. If 휔 is computable then 퐶(휔(1 : 푛) | 푛) < 0. Is it true that + 퐶(휔(1 : 푛) | 푛) < 0 for all 푛 then indeed 휔 is computable? Yes, but the proof is quite difficult (see either its original, attributed to Albert Meyer in [37], or
74 3.2. The complexity of decidable and enumerable sets in [61]). The suspicion arises that when measuring the complexity of starting segments of infinite sequences, neither Kolmogorov complexity nor its prefix version are the most natural choice. Other examples will support the suspicion, so let us introduce, following Loveland’s paper [37], the following variant. Definition 3.2.1 Let us say that a program 푝 decides a string 푥 = (푥(1), 푥(2),... ) on a Turing machine 푇 up to 푛 if for all 푖 ∈ {1, . . . , 푛} we have 푇 ( 푝, 푖) = 푥(푖). Let us further define the decision complexity
퐶푇 (푥; 푛) = min{푙( 푝) : 푝 decides 푥 on 푇 up to 푛}.
y As for the Kolmogorov complexity, an invariance theorem holds, and there is an optimal machine 푇0 for this complexity. Again, we omit the index 푇, assuming that such an optimal machine has been fixed. The differences between 퐶(푥), 퐶(푥 | 푛) and 퐶(푥; 푛) are interesting to ex- plore; intuitively, the important difference is that the program achieving the decision complexity of 푥 does not have to offer precise information about the length of 푥. We clearly have
+ + 퐶(푥 | 푛) < 퐶(푥; 푛) < 퐶(푥).
Examples that each of these inequalities can be strict, are left to exercises. Remark 3.2.2 If 휔(1 : 푛) is a segment of an infinite sequence 휔 then we can write 퐶(휔; 푛) instead of 퐶(휔(1 : 푛); 푛) without confusion, since there is only one way to understand this expression. y Decision complexity offers an easy characterization of decidable infinite se- quences. Theorem 3.2.1 Let 휔 = (휔(1), 휔(2),... ) be an infinite sequence. Then 휔 is decidable if and only if + 퐶(휔; 푛) < 0. (3.2.1) Proof. If 휔 is decidable then (3.2.1) is obviously true. Suppose now that (3.2.1) holds: there is a constant 푐 such that for all 푛 we have 퐶(휔(1 : 푛); 푛) < 푐.
75 3. Information
Then there is a program 푝 of length ≤ 푐 with the property that for infinitely many 푛, decides 푥 on the optimal machine 푇0 up to 푛. Then for all 푖, we have 푇0 ( 푝, 푖) = 휔(푖), showing that 휔 is decidable. Let us use now the new tool to prove a somewhat more surprising result. Let 휔 be a 0-1 sequence that is the indicator function of a recursively enumerable set. In other words, the set {푖 : 휔(푖) = 1} is recursively enumerable. As we know such sequences are not always decidable, so 퐶(휔(1 : 푛); 푛) is not going to be bounded. How fast can it grow? The following theorem gives exact answer, showing that it grows only logarithmically. Theorem 3.2.2 (Barzdin, see [2]) a) Let 휔 be the indicator function of a recursively enumerable set 퐸. Then we have + 퐶(휔; 푛) < log 푛. b) The set 퐸 can be chosen such that for all 푛 we have 퐶(휔; 푛) ≥ log 푛. Proof. Let us prove (a) first. Let 푛0 be the first power of 2 larger than 푛. Let Í 푘(푛) = 푖≤푛0 휔(푖). For a constant 푑, let 푝 = 푝(푛, 푑) be a program that contains an initial segment 푞 of size 푑 followed by a the number 푘(푛) in binary notation, and padded to dlog 푛e + 푑 with 0’s. The program 푞 is self-delimiting, so it sees where 푘(푛) begins. The machine 푇0 ( 푝, 푖) works as follows, under the control of 푞. From the length of the whole 푝, it determines 푛0. It begins to enumerate 휔(1 : 푛0) until it found 푘(푛) 1’s. Then it knows the whole 휔(1 : 푛0), so it outputs 휔(푖). Let us prove now (b). Let us list all possible programs 푞 for the machine 푇0 as 푞 = 1, 2,.... Let 휔(푞) = 1 if and only if 푇0 (푞, 푞) = 0. The sequence 휔 is obviously the indicator function of a recursively enumerable set. To show that 휔 has the desired property, assume that for some 푛 there is a 푝 with 푇0 ( 푝, 푖) = 휔(푖) for all 푖 ≤ 푛. Then 푝 > 푛 since 휔( 푝) is defined to be different from 푇0 ( 푝, 푝). It follows that 푙( 푝) ≥ log 푛. Decision complexity has been especially convenient for the above theorem. Neither 퐶(휔(1 : 푛)) nor 퐶(휔(1 : 푛) | 푛) would have been suitable to formulate such a sharp result. To analyze the phenomenon further, we introduce some more concepts. Definition 3.2.3 Let us denote, for this section, by 퐸 the set of those binary strings 푝 on which the optimal prefix machine 푇 halts: 퐸푡 = { 푝 : 푇 ( 푝) halts in < 푡 steps}, (3.2.2) 퐸 = 퐸∞.
76 3.2. The complexity of decidable and enumerable sets
Let 휒 = 휒퐸 (3.2.3) be the infinite sequence that is the indicator function of the set 퐸, when the latter is viewed as a set of numbers. y It is easy to see that the set 퐸 is complete among recursively enumerable sets with respect to many-one reduction. The above theorems show that though it contains an infinite amount of information, this information is not stored in the sequence 휒 densely at all: there are at most log 푛 bits of it in the segment 휒(1 : 푛). There is an infinite sequence, though, in which the same information is stored much more densely: as we will see later, maximally densely. Definition 3.2.4 (Chaitin’s Omega) Let ∑︁ Ω푡 = 2−푙( 푝) , 푝∈퐸푡 (3.2.4) Ω = Ω∞. y Let Ω(1 : 푛) be the sequence of the first 푛 binary digits in the expansion of Ω, and let it also denote the binary number 0.Ω(1)··· Ω(푛). Then we have
Ω(1 : 푛) < Ω < Ω(1 : 푛) + 2−푛.
Theorem 3.2.3 The sequences Ω and 휒 are Turing-equivalent. Proof. Let us show first that given Ω as an oracle, a Turing machine can compute 휒. Suppose that we want to know for a given string 푝 of length 푘 whether 휒( 푝) = 1 that is whether 푇 ( 푝) halts. Let 푡(푛) be the first 푡 for which Ω푡 > Ω(1 : 푛). If a program 푝 of length 푛 is not in 퐸푡(푛) then it is not in 퐸 at all, since 2−푙( 푝) = 2푛 < Ω − Ω푡. It follows that 휒(1 : 2푛) can be computed from Ω(1 : 푛). To show that Ω can be computed from 퐸, let us define the recursively enu- merable set 퐸0 = {푟 : 푟 rational, Ω > 푟}. The set 퐸0 is reducible to 퐸 since the latter is complete among recursively enumer- able sets with respect to many-one reduction. On the other hand, Ω is obviously computable from 퐸0. The following theorem shows that Ω stores the information more densely. + Theorem 3.2.4 We have 퐾(Ω(1 : 푛)) > 푛.
77 3. Information
Proof. Let 푝1 be a self-delimiting program outputting Ω(1 : 푛). Recall the def- 푡 푡 inition of the sets 퐸 in (3.2.2) and the numbers Ω in (3.2.4). Let Ω1 denote the real number whose binary digits after 0 are given by Ω(1 : 푛), and let 푡1 be 푡 the first 푡 with Ω > Ω1. Let 푥1 be the first string 푥 such that 푇 ( 푝) ≠ 푥 for any 푝 ∈ 퐸푡1 . We have 퐾(푥1) ≥ 푛. On the other hand, we just computed 푥1 from 푝1, so + + + 퐾(푥1 | 푝1) < 0. We found 푛 ≤ 퐾(푥1) < 퐾( 푝1) + 퐾(푥1 | 푝1) < 퐾( 푝1).
3.3 The complexity of complexity
3.3.1 Complexity is sometimes complex This section is devoted to a quantitative estimation of the uncomputability of the complexity function 퐶(푥). Actually, we will work with the prefix complexity 퐾(푥), but the results would be very similar for 퐶(푥). The first result shows that the value 퐾(푥) is not only not computable from 푥, but its conditional complexity 퐾(퐾(푥) | 푥) given 푥 is sometimes quite large. How large can it be expected to be? Certainly not much larger than log 퐾(푥) + 2 log log 퐾(푥), since we can describe any value 퐾(푥) using this many bits. But it can come close to this, as shown by Theorem 3.1.5. This theorem says that for all 푛, there exists 푥 of length 푛 with
+ 퐾(퐾(푥) | 푥) > log 푛 − log log 푛. (3.3.1)
Proof of Theorem 3.1.5. Let 푈( 푝, 푥) be the optimal self-delimiting machine for which 퐾( 푦 | 푥) = min푈 ( 푝,푥)=푦 푙( 푝). Let 푠 ≤ log 푛 be such that if 푙(푥) = 푛 then a 푝 of length 푥 can be found for which 푈( 푝, 푥) = 퐾(푥). We will show
+ 푠 > log 푛 − log log 푛. (3.3.2)
Let us say that 푝 ∈ {0, 1}푠 is suitable for 푥 ∈ {0, 1}푛 if there exists a 푘 = 푈( 푝, 푥) and a 푞 ∈ {0, 1}푘 with 푈( 푝, Λ) = 푥. Thus, 푝 is suitable for 푥 if it produces the length of some program of 푥, not necessarily a shortest program. Let 푀푖 denote the set of those 푥 of length 푛 for which there exist at least 푖 different suitable 푝 of length 푠. We will examine the sequence
푛 {0, 1} = 푀0 ⊇ 푀1 ⊇ ... ⊇ 푀 푗 ⊇ 푀 푗+1 = ∅,
푠 where 푀 푗 ≠ ∅. It is clear that 2 ≥ 푗. To lowerbound 푗, we will show that the sets 푀푖 decrease only slowly:
+ log |푀푖 | < log |푀푖+1| + 4 log 푛. (3.3.3)
78 3.3. The complexity of complexity
We can assume log |푀푖 \ 푀푖+1| ≥ log |푀푖 | − 1, (3.3.4) otherwise (3.3.3) is trivial. We will write a program that finds an element 푥 of 푀푖 \ 푀푖+1 with the property 퐾(푥) ≥ log |푀푖 | − 1. The program works as follows. • It finds 푖, 푛 with the help of descriptions of length log 푛 + 2 log log 푛, and 푠 with the help of a description of length 2 log log 푛.
• It finds |푀푖+1| with the help of a description of length log |푀푖+1| + log 푛 + 2 log log 푛.
• From these data, it determines the set 푀푖+1, and then begins to enumerate the set 푀푖 \ 푀푖+1 as 푥1, 푥2,.... For each of these elements 푥푟, it knows that there are exactly 푖 programs suitable for 푥푟, find all those, and find the shortest program for 푥 produced by these. Therefore it can compute 퐾(푥푟).
• According to the assumption (3.3.4), there is an 푥푟 with 퐾(푥푟) ≥ log |푀푖 | − 1. The program outputs the first such 푥푟. + The construction of the program shows that its length is < log |푀푖+1| + 4 log 푛, hence for the 푥 we found
+ log |푀푖 | − 1 ≤ 퐾(푥) < log |푀푖+1| + 4 log 푛, which proves (3.3.3). This implies 푗 ≥ 푛/(4 log 푛), and hence (3.3.2).
3.3.2 Complexity is rarely complex Let 푓 (푥) = 퐾(퐾(푥) | 푥). We have seen that 푓 (푥) is sometimes large. Here, we will show that the sequen- ces 푥 for which this happens are rare. Recall that we defined 휒 in (3.2.3) as the indicator sequence of the halting problem. Theorem 3.3.1 We have
+ 퐼(휒 : 푥) > 푓 (푥) − 2.4 log 푓 (푥).
In view of the inequality (3.1.18), this shows that such sequences are rare, even in terms of the universal probability, so they are certainly rare if we measure them by any computable distribution. So, we may even call such sequences “exotic”.
79 3. Information
In the proof, we start by showing that the sequences in question are rare. Then the theorem will follow when we make use of the fact that 푓 (푥) is com- putable from 휒. We need a lemma about the approximation of one measure by another from below. 푚 Lemma 3.3.1 For any semimeasure 휈 and measure 휇 ≤ 휈, let 푆푚 = {푥 : 2 휇(푥) ≤ −푚 Í −푛 −푛+1 휈(푥)}. Then 휇(푆푚) ≤ 2 . If 푥 (휈(푥) − 휇(푥)) < 2 then 휈(푆1) < 2 . The proof is straightforward. Let 푋 (푛) = {푥 : 푓 (푥) = 푛}. Lemma 3.3.2 We have ∗ m(푋 (푛)) < 푛1.22−푛. Proof. Using the definition of 퐸푡 in (3.2.2), let ∑︁ m푡 (푥) = { 푝 : 푇 ( 푝) = 푥 in < 푡 steps}, 퐾푡 (푥) = − log m푡 (푥).
Then according to the definition (1.6.14) we have m(푥) = m∞(푥), and 퐾(푥) =+ 퐾∞(푥). Let
푡(푘) = min{푡 : Ω(1 : 푘) < Ω푡 }, 휇 = m푡(푘) .
+ Clearly, 퐾푡(푘) (푥) > 퐾(푥). Let us show that 퐾푡(푘) (푥) is a good approximation for 퐾(푥) for most 푥. Let
푌 (푘) = {푥 : 퐾(푥) ≤ 퐾푡(푘) (푥) − 1}.
By definition, for 푥 ∉ 푌 (푘) we have
|퐾푡(푘) (푥) − 퐾(푥)| =+ 0.
On the other hand, applying Lemma 3.3.1 with 휇 = m푡(푘) , 휈 = m, we obtain
m(푌 (푘)) < 2−푘+1. (3.3.5)
Note that 퐾(퐾푡(푘) (푥) | 푥, Ω(1 : 푘)) =+ 0,
80 3.3. The complexity of complexity therefore for 푥 ∉ 푌 (푘) we have
퐾(퐾(푥) | 푥, Ω(1 : 푘)) =+ 0, + 퐾(퐾(푥) | 푥) < 푘 + 1.2 log 푘.
If 푛 = 푘 + 1.2 log 푘 then 푘 =+ 푛 − 1.2 log 푛, and hence, if 푥 ∉ 푌 (푛 − 1.2 log 푛) + then 퐾(퐾(푥) | 푥) < 푛. Thus, there is a constant 푐 such that
푋 (푛) ⊆ 푌 (푛 − 1.2 log 푛 − 푐).
Using (3.3.5) this gives the statement of the lemma. Proof of Theorem 3.3.1. Since 푓 (푥) is computable from Ω, the function
휈(푥) = m(푥)2 푓 (푥) ( 푓 (푥))−2.4 is computable from Ω. Let us show that it is a semimeasure (within a multiplica- tive constant). Indeed, using the above lemma:
∑︁ ∑︁ ∑︁ ∑︁ ∑︁ ∗ 휈(푥) = 휈(푥) = 2푘푘−2.4m(푋 (푘)) = 푘−1.2 < 1. 푥 푘 푥 ∈푋 (푘) 푘 푘
∗ Since m(· | Ω) is the universal semimeasure relative to Ω we find m(푥 | Ω) > 휈(푥), hence
+ 퐾(푥 | Ω) < − log 휈(푥) = 퐾(푥) − 푓 (푥) + 2.4 log 푓 (푥), + 퐼(Ω : 푥) > 푓 (푥) − 2.4 log 푓 (푥).
Since Ω is equivalent to 휒, the proof is complete.
81
4 Generalizations
4.1 Continuous spaces, noncomputable measures
This section starts the consideration of randomness in continuous spaces and randomness with respect to noncomputable measures.
4.1.1 Introduction The algorithmic theory of randomness is well developed when the underlying space is the set of finite or infinite sequences and the underlying probability distribution is the uniform distribution or a computable distribution. These re- strictions seem artificial. Some progress has been made to extend the theory to arbitrary Bernoulli distributions by Martin-Löf in [38], and to arbitrary distribu- tions, by Levin in [30, 32, 33]. The paper [25] by Hertling and Weihrauch also works in general spaces, but it is restricted to computable measures. Similarly, Asarin’s thesis [1] defines randomness for sample paths of the Brownian motion: a fixed random process with computable distribution. The exposition here has been inspired mainly by Levin’s early paper [32] (and the much more elaborate [33] that uses different definitions): let us sum- marize part of the content of [32]. The notion of a constructive topological space X and the space of measures over X is introduced. Then the paper de- fines the notion of a uniform test. Each test is a lower semicomputable function ∫ (휇, 푥) ↦→ 푓휇 (푥), satisfying 푓휇 (푥)휇(푑푥) ≤ 1 for each measure 휇. There are also some additional conditions. The main claims are the following.
a) There is a universal test t휇 (푥), a test such that for each other test 푓 there is a constant 푐 > 0 with 푓휇 (푥) ≤ 푐 · t휇 (푥). The deficiency of randomness is defined as d휇 (푥) = log t휇 (푥). b) The universal test has some strong properties of “randomness conservation”: these say, essentially, that a computable mapping or a computable random-
83 4. Generalizations
ized transition does not decrease randomness. c) There is a measure 푀 with the property that for every outcome 푥 we have t푀 (푥) ≤ 1. In the present work, we will call such measures neutral. d) Semimeasures (semi-additive measures) are introduced and it is shown that there is a lower semicomputable semi-measure that is neutral (let us assume that the 푀 introduced above is lower semicomputable). e) Mutual information 퐼(푥 : 푦) is defined with the help of (an appropriate version of) Kolmogorov complexity, between outcomes 푥 and 푦. It is shown that 퐼(푥 : 푦) is essentially equal to d푀×푀 (푥, 푦). This interprets mutual information as a kind of “deficiency of independence”. This impressive theory leaves a number of issues unresolved: 1. The space of outcomes is restricted to be a compact topological space, more- over, a particular compact space: the set of sequences over a finite alphabet (or, implicitly in [33], a compactified infinite alphabet). However, a good deal of modern probability theory happens over spaces that are not even lo- cally compact: for example, in case of the Brownian motion, over the space of continuous functions. 2. The definition of a uniform randomness test includes some conditions (dif- ferent ones in [32] and in [33]) that seem somewhat arbitrary. 3. No simple expression is known for the general universal test in terms of de- scription complexity. Such expressions are nice to have if they are available. Here, we intend to carry out as much of Levin’s program as seems possible after removing the restrictions. A number of questions remain open, but we feel that they are worth to be at least formulated. A fairly large part of the exposition is devoted to the necessary conceptual machinery. This will also allow to carry further some other initiatives started in the works [38] and [30]: the study of tests that test nonrandomness with respect to a whole class of measures (like the Bernoulli measures). Constructive analysis has been developed by several authors, converging ap- proximately on the same concepts, We will make use of a simplified version of the theory introduced in [57]. As we have not found a constructive measure the- ory in the literature fitting our purposes, we will develop this theory here, over (constructive) complete separable metric spaces. This generality is well sup- ported by standard results in measure theoretical probability, and is sufficient for constructivizing a large part of current probability theory.
84 4.1. Continuous spaces, noncomputable measures
The appendix recalls some of the needed topology, measure theory, construc- tive analysis and constructive measure theory. We also make use of the notation introduced there.
4.1.2 Uniform tests We first define tests of randomness with respect to an arbitrary measure. Recall the definition of lower semicomputable real functions on a computable metric space X. Definition 4.1.1 Let us be given a computable complete metric space X = (푋, 푑, 퐷, 훼). For an arbitrary measure 휇 ∈ M(X), a 휇-test of randomness is a 휇-lower semicomputable function 푓 : 푋 → R+ with the property 휇 푓 ≤ 1. We call an element 푥 random with respect to 휇 if 푓 (푥) < ∞ for all 휇-tests 푓 . But even among random elements, the size of the tests quantifies (non-)randomness. A uniform test of randomness is a lower semicomputable function 푓 : X × M(X) → R+, written as (푥, 휇) ↦→ 푓휇 (푥) such that 푓휇 (·) is a 휇-test for each 휇. y The condition 휇 푓 ≤ 1 guarantees that the probability of those outcomes whose randomness is ≥ 푚 is at most 1/푚. The definition of tests is in the spirit of Martin-Löf’s tests. The important difference is in the semicomputability con- dition: instead of restricting the measure 휇 to be computable, we require the test to be lower semicomputable also in its argument 휇. The following result implies that every 휇-test can be extended to a uniform test.
Theorem 4.1.1 Let 휙푒, 푒 = 1, 2,... be an enumeration of all lower semicom- putable functions X × Y × M(X) → R+, where Y is also a computable metric space, and 푠 : Y × M(X) → R a lower semicomputable function. There is a recursive function 푒 ↦→ 푒0 with the property that
a) For each 푒, the function 휙푒0 (푥, 푦, 휇) is everywhere defined with 푥 휇 휙푒0 (푥, 푦, 휇) ≤ 푠( 푦, 휇). 푥 b) For each 푒, 푦, 휇, if 휇 휙푒 (푥, 푦휇) < 푠( 푦, 휇) then 휙푒0 (·, 푦, 휇) = 휙푒 (·, 푦, 휇). This theorem generalizes a theorem of Hoyrup and Rojas (in allowing a lower semicomputable upper bound 푠( 푦, 휇)).
Proof. By Proposition B.1.38, we can represent 휙푒 (푥, 푦, 휇) as a supremum 휙푒 = sup푖 ℎ푒,푖 where ℎ푒,푖 (푥, 푦, 휇) is a computable function of 푒, 푖 monotonically in- creasing in 푖. Similarly, we can represent 푠( 푦, 휇) as a supremum sup 푗 푠푗 ( 푦, 휇)
85 4. Generalizations
where 푠푗 ( 푦, 휇) is a computable function monotonically increasing in 푗. The inte- 푥 gral 휇 ℎ푒,푖 (푥, 푦, 휇) is computable as a function of ( 푦, 휇), in particular it is upper semicomputable. 0 푥 Define ℎ푒,푖, 푗 (푥, 푦, 휇) as ℎ푒,푖 (푥, 푦, 휇) for all 푗, 푦, 휇 with 휇 ℎ푒,푖 (푥, 푦, 휇) < 푠푗 ( 푦, 휇), and 0 otherwise. Since 푠푗 ( 푦, 휇) is computable this definition makes 0 00 the function ℎ푒,푖, 푗 (푥, 푦, 휇) lower semicomputable. The function ℎ푒,푖 (푥, 푦, 휇) = 0 00 sup 푗 ℎ푒,푖, 푗 (푥, 푦, 휇) is then also lower semicomputable, with ℎ푒,푖 (푥, 푦, 휇) ≤ 푥 0 00 ℎ푒,푖 (푥, 푦, 휇), and 휇 ℎ푒,푖 (푥, 푦, 휇) ≤ 푠( 푦, 휇). Also, ℎ푒,푖 (푥, 푦, 휇) is monotonically 0 00 increasing in 푖. The function 휙푒 (푥, 푦, 휇) = sup푖 ℎ푒,푖 (푥, 푦, 휇) is then also lower 푥 0 semicomputable, and by Fubini’s theorem we have 휇 휙푒 (푥, 푦, 휇) ≤ 푠( 푦, 휇). 0 Define 휙푒0 (푥, 푦, 휇) = 휙푒 (푥, 푦, 휇). Consider any 푒, 푦, 휇 such that 푥 푥 휇 휙푒 (푥, 푦, 휇) < 푠( 푦, 휇) holds. Then for every 푖 there is a 푗 with 휇 ℎ푒,푖 (푥, 푦, 휇) < 0 00 푠푗 ( 푦, 휇), and hence ℎ푒,푖, 푗 (푥, 푦, 휇) = ℎ푒,푖 (푥, 푦, 휇). It follows that ℎ푒,푖 (푥, 푦, 휇) = 0 ℎ푒,푖 (푥, 푦, 휇) for all 푖 and hence 휙푒 (푥, 푦, 휇) = 휙푒 (푥, 푦, 휇).
Corollary 4.1.2 (Uniform extension) There is an operation 퐻푒 (푥, 휇) ↦→ 퐻푒0 (푥, 휇) with the property that 퐻푒0 (푥, 휇) is a uniform test and if 2퐻푒 (·, 휇) is a 휇-test then 퐻푒0 (푥, 휇) = 퐻푒 (푥, 휇). ( ) 1 ( ) ( ) Proof. In Theorem 4.1.1 set 휙푒 푥, 푦, 휇 = 2 퐻푒 푥, 휇 with 푠 푦, 휇 = 1.
Corollary 4.1.3 (Universal generalized test) Let 푠 : Y × M(X) → R+ a lower semicomputable function. Let 퐸 be the set of lower semicomputable functions 휙(푥, 푦, 휇) ≥ 0 with 휇푥휙(푥, 푦, 휇) ≤ 푠( 푦, 휇). There is a function 휓 ∈ 퐸 that is optimal in the sense that for all 휙 ∈ 퐸 there is a constant 푐휙 with 휙 ≤ 2푐휙휓. 0 Proof. Apply the operation 푒 ↦→ 푒 of theorem 4.1.1 to the sequence 휙푒 (푥, 푦, 휇) (푒 = 1, 2,...) of all lower semicomputable functions of 푥, 푦, 휇. The elements of 0 0 the sequence 휙푒 (푥, 푦, 휇), 푒 = 1, 2,... are in 퐸 and and the sequence 2휙푒 (푥, 푦, 휇), Í∞ −푒 0 푒 = 1, 2,... contains all elements of 퐸. Hence the function 휓(푥, 푦, 휇) = 푒=1 2 휙푒 (푥, 푦, 휇) is in 퐸 and has the optimality property. Definition 4.1.4 A uniform test 푢 is called universal if for every other test 푡 there is a constant 푐푡 > 0 such that for all 푥, 휇 we have 푡휇 (푥) ≤ 푐푢휇 (푥). y Theorem 4.1.2 (Universal test,[26]) There is a universal uniform test. Proof. This is a special case of Corollary 4.1.3 with 푠( 푦, 휇) = 1.
86 4.1. Continuous spaces, noncomputable measures
Definition 4.1.5 Let us fix a universal uniform test, called t휇 (푥). An element 푥 ∈ 푋 is called random with respect to measure 휇 ∈ M(X) if t휇 (푥) < ∞. The deficiency of randomness is defined as d휇 (푥) = log t휇 (푥). y If the space is discrete then typically all elements are random with respect to 휇, but they will still be distinguished according to their different values of d휇 (푥).
4.1.3 Sequences Let our computable metric space X = (푋, 푑, 퐷, 훼) be the Cantor space of Exam- ple B.1.35.2: the set of sequences over a (finite or countable) alphabet ΣN. We may want to measure the non-randomness of finite sequences, viewing them as initial segments of infinite sequences. Take the universal test t휇 (푥). For this, it is helpful to apply the representation of Proposition B.1.23, taking into account that adding the extra parameter 휇 does not change the validity of the theorem: ∗ Proposition 4.1.6 There is a function 푔 : M(X) × Σ → R+ with t휇 (휉) = ≤푛 sup푛 푔휇 (휉 ), and with the following properties: a) 푔 is lower semicomputable.
b) 푣 v 푤 implies 푔휇 (푣) ≤ 푔휇 (푤). Í c) For all integer 푛 ≥ 0 we have 푤∈Σ푛 휇(푤)푔휇 (푤) ≤ 1. ≤푛 The properties of the function 푔휇 (푤) clearly imply that sup푛 푔휇 (휉 ) is a uniform test. The existence of a universal function among the functions 푔 can be proved by the usual methods:
Proposition 4.1.7 Among the functions 푔휇 (푤) satisfying the properties listed in Proposition 4.1.6, there is one that dominates to within a multiplicative constant. These facts motivate the following definition. Definition 4.1.8 (Extended test) Over the Cantor space, we extend the defini- tion of a universal test t휇 (푥) to finite sequences as follows. We fix a function ∗ t휇 (푤) with 푤 ∈ Σ whose existence is assured by Proposition 4.1.7. For infinite ≤푛 sequences 휉 we define t휇 (휉) = sup푛 t휇 (휉 ). The test with values defined also on finite sequences will be called an extended test. y We could try to define extended tests also over arbitrary constructive metric spaces, extending them to the canonical balls, with the monotonicity property
87 4. Generalizations
that t휇 (푣) ≤ t휇 (푤) if ball 푤 is manifestly included in ball 푣. But there is nothing simple and obvious corresponding to the integral requirement (c). Over the space ΣN for a finite alphabet Σ, an extended test could also be extracted directly from a test, using the following formula (as observed by Vyugin and Shen). Definition 4.1.9
푢(푧) = inf{푢(휔) : 휔 is an infinite extension of 푧}.
y This function is lower semicomputable, by Proposition B.1.31.
4.1.4 Conservation of randomness
For 푖 = 1, 0, let X푖 = (푋푖, 푑푖, 퐷푖, 훼푖) be computable metric spaces, and let M푖 = (M(X푖), 휎푖, 휈푖) be the effective topological space of probability measures over X푖. Let Λ be a computable probability kernel from X1 to X0 as defined in Subsection B.2.3. In the following theorem, the same notation d휇 (푥) will refer to the deficiency of randomness with respect to two different spaces, X1 and X0, but this should not cause confusion. Let us first spell out the conservation theorem before interpreting it.
Theorem 4.1.3 For a computable probability kernel Λ from X1 to X0, we have
푦 ∗ 휆푥 tΛ∗휇 ( 푦) < t휇 (푥). (4.1.1)
Proof. Let t휈 (푥) be the universal test over X0. The left-hand side of (4.1.1) can be written as 푢휇 = ΛtΛ∗휇 . ∗ According to (A.2.4), we have 휇푢휇 = (Λ 휇)tΛ∗휇 which is ≤ 1 since t is a test. If we show that (휇, 푥) ↦→ 푢휇 (푥) is lower semicomputable then the universality of ∗ t휇 will imply that 푢휇 < t휇. According to Proposition B.1.38, as a lower semicomputable function, t휈 ( 푦) can be written as sup푛 푔푛 (휈, 푦), where (푔푛 (휈, 푦)) is a computable sequence of computable functions. We pointed out in Subsection B.2.3 that the function ∗ ∗ 휇 ↦→ Λ 휇 is computable. Therefore the function (푛, 휇, 푥) ↦→ 푔푛 (Λ 휇, 푓 (푥)) is also a computable. So, 푢휇 (푥) is the supremum of a computable sequence of computable functions and as such, lower semicomputable.
88 4.1. Continuous spaces, noncomputable measures
It is easier to interpret the theorem first in the special case when Λ = Λℎ for a computable function ℎ : 푋1 → 푋0, as in Example B.2.12. Then the theorem simplifies to the following. + Corollary 4.1.10 For a computable function ℎ : 푋1 → 푋0, we have dℎ∗휇 (ℎ(푥)) < d휇 (푥).
Informally, this says that if 푥 is random with respect to 휇 in X1 then ℎ(푥) is ∗ essentially at least as random with respect to the output distribution ℎ 휇 in X0. Decrease in randomness can only be caused by complexity in the definition of the function ℎ. Let us specialize the theorem even more:
Corollary 4.1.11 For a probability distribution 휇 over the space 푋 × 푌 let 휇푋 be its marginal on the space 푋. Then we have
+ d휇푋 (푥) < d휇 ((푥, 푦)).
This says, informally, that if a pair is random then each of its elements is random (with respect to the corresponding marginal distribution). In the general case of the theorem, concerning random transitions, we cannot bound the randomness of each outcome uniformly. The theorem asserts that the average nonrandomness, as measured by the universal test with respect to the + 푦 dΛ∗휇 ( 푦) output distribution, does not increase. In logarithmic notation: 휆푥 2 < ∫ d ∗ ( 푦) + d휇 (푥), or equivalently, 2 Λ 휇 휆푥 (푑 푦) < d휇 (푥).
Corollary 4.1.12 Let Λ be a computable probability kernel from X1 to X0. There is a constant 푐 such that for every 푥 ∈ X1, and integer 푚 > 0 we have
−푚 휆푥 { 푦 : dΛ∗휇 ( 푦) > d휇 (푥) + 푚 + 푐} ≤ 2 .
Thus, in a computable random transition, the probability of an increase of randomness deficiency by 푚 units (plus a constant 푐) is less than 2−푚. The constant 푐 comes from the description complexity of the transition Λ. A randomness conservation result related to Corollary 4.1.10 was proved in [25]. There, the measure over the space X0 is not the output measure of the transformation, but is assumed to obey certain inequalities related to the transformation.
89 4. Generalizations
4.2 Test for a class of measures
4.2.1 From a uniform test A Bernoulli measure is what we get by tossing a (possibly biased) coin repeatedly. Definition 4.2.1 Let 푋 = BN be the set of infinite binary sequences, with the usual sequence topology. Let 퐵푝 be the measure on 푋 that corresponds to tossing a coin independently with probability 푝 of success: it is called the Bernoulli measure with parameter 푝. Let B denote the set of all Bernoulli measures. y Given a sequence 푥 ∈ 푋 we may ask the question whether 푥 is random with respect to at least some Bernoulli measure. (It can clearly not be random with respect to two different Bernoulli measures since if 푥 is random with respect to 퐵푝 then its relative frequencies converge to 푝.) This idea suggests two possible definitions for a test of the property of “Bernoulliness”:
1. We could define tB (푥) = inf휇 ∈B (푥). 2. We could define the notion of a Bernoulli test as a lower semicomputable function 푓 (푥) with the property 퐵푝 푓 ≤ 1 for all 푝. We will see that in case of this class of measures the two definitions lead to essentially the same test. Let us first extend the definition to more general sets of measures, still having a convenient property. Definition 4.2.2 (Class tests) Consider a class C of measures that is effectively compact in the sense of Definition B.1.27 or (equivalently for metric spaces) in the sense of Theorem B.1.1. A lower semicomputable function 푓 (푥) is called a C-test if for all 휇 ∈ C we have 휇 푓 ≤ 1. It is a universal C-test if it dominates all other C-tests to within a multiplicative constant. y Example 4.2.3 It is easy to show that the class B is effectively compact. One way is to appeal to the general theorem in Proposition B.1.32 saying that applying a computable function to an effectively compact set (in this case the interval [0, 1]), the image is also an effectively compact set. y For the case of infinite sequences, we can also define extended tests. Definition 4.2.4 (Extended class test) Let our space X be the Cantor space of infinite sequences ΣN. Consider a class C of measures that is effectively compact in the sense of Definition B.1.27 or (equivalently for metric spaces) in the sense ∗ of Theorem B.1.1. A lower semicomputable function 푓 : Σ → R+ is called an extended C-test if it is monotonic with respect to the prefix relation and for all
90 4.2. Test for a class of measures
휇 ∈ C and integer 푛 ≥ 0 we have ∑︁ 휇(푥) 푓 (푥) ≤ 1. 푥 ∈Σ푛
It is universal if it dominates all other extended C-tests to within a multiplicative constant. y The following observation is immediate. N Proposition 4.2.5 A function 푓 : Σ → R+ is a class test if and only if it can be ≤푛 represented as lim푛 푔(휉 ) where 푔(푥) is an extended class test. The following theorem defines a universal C-test.
Theorem 4.2.1 Let t휇 (푥) be a universal uniform test. Then 푢(푥) = inf휇 ∈C t휇 (푥) defines a universal C-test.
Proof. Let us show first that 푢(푥) is a C-test. It is lower semicomputable accord- ing to Proposition B.1.31. Also, for each 휇 we have 휇푢 ≤ 휇t휇 ≤ 1, showing that 푢(푥) is a C-test. Let us now show that it is universal. Let 푓 (푥) be an arbitrary C-test. By Corollary 4.1.2 there is a uniform test 푔휇 (푥) such that for all 휇 ∈ C we have 푔휇 (푥) = 푓 (푥)/2. It follows from the universality of the uniform test t휇 (푥) that ∗ ∗ ∗ 푓 (푥) < 푔휇 (푥) < t휇 (푥) for all 휇 ∈ C. But then 푓 (푥) < inf휇 ∈C t휇 (푥) = 푢(푥). For the case of sequences, the same statement can be made for extended tests. (This is not completely automatic since a test is obtained from an extended test via a supremum, on the other hand a class test is obtained, according the theorem above, via an infimum.)
4.2.2 Typicality and class tests The set of Bernoulli measures has an important property shared by many classes considered in practice: namely that random sequences determine the measure to which it belongs. Definition 4.2.6 Consider a class C of measures over a computable metric space X = (푋, 푑, 퐷, 훼). We will say that a lower semicomputable function
푠 : 푋 × C → R+ is a separating test for C if
• 푠휇 (·) is a test for each 휇 ∈ C.
91 4. Generalizations
• If 휇 ≠ 휈 then 푠휇 (푥) ∨ 푠휈 (푥) = ∞ for all 푥 ∈ 푋.
Given a separating test 푠휇 (푥) we call an element 푥 typical for 휇 ∈ C if 푠휇 (푥) < ∞. y A typical element determines uniquely the measure 휇 for which it is typical. Note that if a separating tests exists for a class then any two different measures 휇1, 휇2 in the class are orthogonal to each other, that is there are disjoint mea- sureable sets 퐴1, 퐴2 with 휇 푗 (퐴 푗) = 1. Indeed, let 퐴 푗 = {푥 : 푠휇 푗 (푥) < ∞}. Let us show a nontrivial example: the class of B of Bernoulli measures. Recall that by Chebyshev’s inequality we have
푛 ∑︁ 1/2 1/2 −2 퐵푝 ({푥 ∈ B : | 푥(푖) − 푛푝| ≥ 휆푛 ( 푝(1 − 푝)) }) ≤ 휆 . 푖 Since 푝(1 − 푝) ≤ 1/4, this implies
푛 ∑︁ 1/2 −2 퐵푝 ({푥 ∈ B : | 푥(푖) − 푛푝 > 휆푛 /2}) < 휆 . 푖
Setting 휆 = 푛0.1 and ignoring the factor 1/2 gives
푛 ∑︁ 0.6 −0.2 퐵푝 ({푥 ∈ B : | 푥(푖) − 푛푝| > 푛 }) < 푛 . 푖
Setting 푛 = 2푘:
2푘 ∑︁ 푘 0.6푘 −0.2푘 퐵푝 ({푥 ∈ B : | 푥(푖) − 2 푝| > 2 }) < 2 . (4.2.1) 푖 Now, for the example. Example 4.2.7 For a sequence 휉 in BN, and for 푝 ∈ [0, 1] let
2푘 ∑︁ 푘 0.6푘 푔푝 (푥) = 푔퐵푝 (푥) = sup{푘 : | 휉(푖) − 2 푝| > 2 }. 푖=1 Then we have
휉 ∑︁ −0.2푘 퐵푝푔푝 (휉) ≤ 푘 · 2 = 푐 < ∞ 푘 for some constant 푐, so 푠푝 (휉) = 푔푝 (푥)/푐 is a test for each 푝. The property 푠푝 (휉) < −푘 Í2푘 ∞ implies that 2 푖=1 휉(푖) converges to 푝. For a given 휉 this is impossible for both 푝 and 푞 for 푝 ≠ 푞, hence 푠푝 (휉) ∨ 푠푞 (휉) = ∞. y
92 4.2. Test for a class of measures
The following structure theorem gives considerable insight.
Theorem 4.2.2 Let C be an effectively compact class of measures, let t휇 (푥) be the universal uniform test and let tC (푥) be a universal class test for C. Assume that a separating test 푠휇 (푥) exists for C. Then we have the representation
∗ t휇 (푥) = tC (푥) ∨ 푠휇 (푥) for all 휇 ∈ C, 푥 ∈ 푋.
∗ Proof. First, we have tC (푥) ∨ 푠휇 (푥) < t휇 (푥). Indeed as we know from the Uni- form Extension Corollary 4.1.2, we can extend 푠휇 (푥)/2 to a uniform test, hence ∗ 푠휇 (푥) < t휇 (푥). Also by definition tC (푥) ≤ t휇 (푥). On the other hand, let us show tC (푥) ∨ 푠휇 (푥) ≥ t휇 (푥). Suppose first that 푥 is not random with respect to any 휈 ∈ C: then tC (푥) = ∞. Suppose now that 푥 is random with respect to some 휈 ∈ C, 휈 ≠ 휇. Then 푠휇 (푥) = ∞. Finally, suppose t휇 (푥) < ∞. Then t휈 (푥) = ∞ for all 휈 ∈ C, 휈 ≠ 휇, hence tC (푥) = inf휈∈C t휈 (푥) = t휇 (푥), so the inequality holds again. The above theorem separates the randomness test into two parts. One part tests randomness with respect to the class C, the other one tests typicality with respect to the measure 휇. In the Bernoulli example,
• Part tB (휉) checks “Bernoulliness”, that is independence. It encompasses all the irregularity criteria.
• Part 푠푝 (휉) checks (crudely) for the law of large numbers: whether relative frequency converges (fast) to 푝. If the independence of the sequence is taken for granted, we may assume that the class test is satisfied. What remains is typicality testing, which is similar to ordinary statistical parameter testing. Remarks 4.2.8
1. Separation is the only requirement of the test 푠휇 (푥), otherwise, for example in the Bernoulli test case, no matter how crude the convergence criterion expressed by 푠휇 (푥), the maximum tC (푥) ∨ 푠휇 (푥) is always (essentially) the same universal test. 2. Though the convergence criterion can be crude, but one still seems to need some kind of constructive convergence of the relative frequencies if the sep- aration test is to be defined in terms of relative frequency convergence. y
93 4. Generalizations
Example 4.2.9 For 휀 > 0 let 푃(휀) be the Markov chain 푋1, 푋2,... with set of states {0, 1}, with transition probabilities 푇 (0, 1) = 푇 (1, 0) = 휀 and 푇 (0, 0) = 푇 (1, 1) = 1 − 휀, and with 푃[ 푋1 = 0] = 푃[ 푋1 = 1] = 1/2. Let C(훿) be the class of all 푃(휀) with 휀 ≥ 훿. For each 훿 > 0, this is an effectively compact class, and a separating test is easy to construct since an effective law of large numbers holds for these Markov chains. We can generalize to the set of 푚-state stationary Markov chains whose eigenvalue gap is ≥ 휀. y This example is in contrast to V’yugin’s example [56] showing that, in the nonergodic case, in general no recursive speed of convergence can be guaranteed in the Ergodic Theorem (which is the appropriate generalization of the law of large numbers)1. We can show that if a computable Markov chain is ergodic then its law of large numbers does have a constructive speed of convergence. Hopefully, this observation can be extended to some interesting compact classes of ergodic pro- cesses.
4.2.3 Martin-Löf’s approach Martin-Löf also gave a definition of Bernoulli tests in [38]. For its definition let us introduce the following notation. Notation 4.2.10 The set of sequences with a given frequency of 1’s will be de- noted as follows: ∑︁ B(푛, 푘) = {푥 ∈ B푛 : 푥(푖) = 푘}. 푖 y Martin-Löf’s definition translates to the integral constraint version as follows: Definition 4.2.11 Let 푋 = BN be the set of infinite binary sequences with the ∗ usual metrics. A combinatorial Bernoulli test is a function 푓 : B → R+ with the following constraints: a) It is lower semicomputable. b) It is monotonic with respect to the prefix relation.
1This paper of Vyugin also shows that though there is no recursive speed of convergence, a certain constructive proof of the pointwise ergodic theorem still gives rise to a test of randomness.
94 4.2. Test for a class of measures
c) For all 0 ≤ 푘 ≤ 푛 we have
∑︁ 푛 푓 (푥) ≤ . (4.2.2) 푘 푥 ∈B(푛,푘) y The following observation is useful. Proposition 4.2.12 If a combinatorial Bernoulli test 푓 (푥) is given on strings 푥 of length less than 푛, then extending it to longer strings using monotonicity we get a function that is still a combinatorial Bernoulli test.
Proof. It is sufficient to check the relation (4.2.2). We have (even for 푘 = 0, when B(푛 − 1, 푘 − 1) = ∅): ∑︁ ∑︁ ∑︁ 푓 (푥) ≤ 푓 ( 푦) + 푓 ( 푦) 푥 ∈B(푛,푘) 푦 ∈B(푛−1,푘−1) 푦 ∈B(푛−1,푘) 푛 − 1 푛 − 1 푛 ≤ + = . 푘 − 1 푘 푘 The following can be shown using standard methods: Proposition 4.2.13 (Universal combinatorial Bernoulli test) There is a universal combinatorial Bernoulli test 푓 (푥), that is a combinatorial Bernoulli test with the property that for every combinatorial Bernoulli test ℎ(푥) there is a constant 푐ℎ > 0 such that for all 푥 we have ℎ(푥) ≤ 푐ℎ푔(푥). Definition 4.2.14 Let us fix a universal combinatorial Bernoulli test 푏(푥) and extend it to infinite sequences 휉 by
푏(휉) = sup 푏(휉≤푛). 푛
N Let tB (휉) be a universal class test for Bernoulli measures, for 휉 ∈ B . y Let us show that the general class test for Bernoulli measures and Martin- Löf’s Bernoulli test yield the same random sequences. ∗ Theorem 4.2.3 With the above definitions, we have 푏(휉) = tB (휉). In words: a sequence is nonrandom with respect to all Bernoulli measures if and only if it is rejected by a universal combinatorial Bernoulli test.
95 4. Generalizations
Proof. We first show 푏(휉) ≤ tB (휉). Moreover, we show that 푏(푥) is an ex- tended class test for the class of Bernoulli measures. We only need to check the sum condition, namely that for all 0 ≤ 푝 ≤ 1, and all 푛 > 0 the inequality Í 푥 ∈B푛 퐵푝 (푥)푏(푥) ≤ 1 holds. Indeed, we have
푛 ∑︁ ∑︁ 푘 푛−푘 ∑︁ 퐵푝 (푥) 푓 (푥) = 푝 (1 − 푝) 푓 (푥) 푥 ∈B푛 푘=0 푥 ∈B(푛,푘) 푛 ∑︁ 푛 ≤ 푝푘 (1 − 푝)푛−푘 = 1. 푘 푘=0 ∗ On the other hand, let 푓 (푥) = tB (푥), 푥 ∈ B be the extended test for tB (휉). For all integers 푁 > 0 let 푛 = b√︁푁/2c. Then as 푁 runs through the integers, 푛 also runs through all integers. For 푥 ∈ B푁 let 퐹(푥) = 푓 (푥 ≤푛). Since 푓 (푥) is lower semicomputable and monotonic with respect to the prefix relation, this is also true of 퐹(푥). Í We need to estimate 푥 ∈B(푁,퐾) 퐹(푥). For this, note that for 푦 ∈ B(푛, 푘) we have 푁 − 푛 |{푥 ∈ B(푁, 퐾) : 푦 v 푥 }| = . (4.2.3) 퐾 − 푘
Now for 0 ≤ 퐾 ≤ 푁 we have ∑︁ ∑︁ 퐹(푥) = 푓 ( 푦)|{푥 ∈ B(푁, 퐾) : 푦 v 푥 }| 푥 ∈B(푁,퐾) 푦 ∈B푛 푛 (4.2.4) ∑︁ 푁 − 푛 ∑︁ = 푓 ( 푦). 퐾 − 푘 푘=0 푦 ∈B(푛,푘)