Loss Bounds and Time Complexity for Speed Priors
Total Page:16
File Type:pdf, Size:1020Kb
Loss Bounds and Time Complexity for Speed Priors Daniel Filan Jan Leike Marcus Hutter Research School of Computer Science, Australian National University Abstract computable (Solomonoff, 1964, 1978). The Solomonoff prior M assigns to a string x the probability that a This paper establishes for the first time the universal Turing machine prints something starting predictive performance of speed priors and with x when fed with fair coin flips. Equivalently, the their computational complexity. A speed prior distribution M can be seen as a Bayesian mixture that is essentially a probability distribution that weighs each distribution according to their Kolmogorov puts low probability on strings that are not ef- complexity (Wood et al., 2013), assigning higher a ficiently computable. We propose a variant to priori probability to simpler hypotheses (Hutter, 2007). the original speed prior (Schmidhuber, 2002), However, M is incomputable (Leike and Hutter, 2015), and show that our prior can predict sequences which has thus far limited its application. drawn from probability measures that are es- Schmidhuber has proposed a computable alternative timable in polynomial time. Our speed prior to M which discounts strings that are not efficiently is computable in doubly-exponential time, but computable (Schmidhuber, 2002). This distribution not in polynomial time. On a polynomial time is called the speed prior because asymptotically only computable sequence our speed prior is com- the computationally fastest distributions that explain putable in exponential time. We show better the data contribute to the mixture. However, no loss upper complexity bounds for Schmidhuber’s bounds for Schmidhuber’s prior, which we write as speed prior under the same conditions, and SFast, are known except in the case where the data are that it predicts deterministic sequences that drawn from a prior like Schmidhuber’s. are computable in polynomial time; however, we also show that it is not computable in poly- We introduce a prior SKt that is related to both SFast nomial time, and the question of its predictive and M, and establish in Section 3 that it is also a properties for stochastic sequences remains speed prior in Schmidhuber’s sense. Our first main open. contribution is a bound on the loss incurred by a SKt- based predictor when predicting strings drawn from a distribution that is computable in polynomial time. 1 Introduction This is proved in Section 4. The bounds we get are only a logarithmic factor worse than the bounds for We consider the general problem of sequence prediction, the Solomonoff predictor. In particular, if the measure where a sequence of symbols x1, x2, . , xt 1 is drawn is deterministic and the loss function penalises errors, from an unknown computable distribution−µ, and the SKt-based prediction will only make a logarithmic num- task is to predict the next symbol x . If µ belongs t ber of errors. Therefore, SKt is able to effectively learn to some known countable class of distributions, then the generating distribution µ. Our second main contri- a Bayesian mixture over the class leads to good loss bution is a proof that the same bound holds for the loss bounds: the expected loss is at most L + O(√L) where incurred by a SFast-based predictor when computing a L is the loss of the informed predictor that knows string deterministically generated in polynomial time, µ (Hutter, 2005, Thm. 3.48). These bounds are known shown in the same section. to be tight. In Section 5 we discuss the time complexity of SKt and Solomonoff’s theory of inductive inference handles the SFast. We show that SFast is computable in exponential most general case where all we know about µ is that it is time while SKt is computable in doubly-exponential time, but not in polynomial time, limiting its practical Appearing in Proceedings of the 19th International Con- ference on Artificial Intelligence and Statistics (AISTATS) applicability. However, we also show that if we are 2016, Cadiz, Spain. JMLR: W&CP volume 41. Copyright predicting a sequence that is computable in polynomial 2016 by the authors. time, it only takes polynomial time to compute SFast 1394 Daniel Filan, Jan Leike, Marcus Hutter and exponential time to compute SKt. 2.2 SFast and M Although the results of this paper are theoretical To define SFast, we first need to define the fast algo- and the algorithms impractical-seeming, related ideas rithm (called search by Li and Vit´anyi (2008), Ch. from the field of algorithmic information theory have 7.5) after which it is named. This algorithm performs been approximated and put into practice. Exam- i p phase i for each i N, whereby 2 −| | instructions ples include the Universal Similarity Metric’s use in of all programs satisfying∈ p i are executed as they clustering (Cilibrasi and Vitanyi, 2005), Solomonoff- would be on U, and the outputs| | ≤ are printed sequen- based reinforcement learning (Veness et al., 2011), and tially, separated by blanks. If string x is computed by the Levin search-inspired Optimal Ordered Problem program p in phase i, then we write p i x. Then, Solver (Schmidhuber, 2004). However, using the the- S is defined as → ory to devise practical applications is a non-trivial task Fast that we leave for future work. ∞ i p SFast(x) := 2− 2−| | (1) i=1 p x X X→i 2 Preliminaries This algorithm is inspired by the Kt complexity of a string, defined as 2.1 Setup and notation Kt(x) = min p + log t(U, p, x) p {| | } Throughout this paper, we use monotone Turing ma- where t(U, p, x) is the time taken for program p to chines with a binary alphabet B = 0, 1 , although compute x on the UTM U, and if program p never all results generalise to arbitrary finite{ alphabets.} A computes x, we set t(U, p, x) := (Li and Vit´anyi, monotone machine is one with a unidirectional read- 2008, Def. 7.5.1). If we define the ∞Kt-cost of a compu- only input tape where the head can only move one way, tation of a string x by program p as the minimand of a unidirectional write-only output tape where the head Kt, that is, can only move one way, and some bidirectional work Kt-cost(p, x) := p + log t(p, x) tapes. We say that a monotone machine T computes | | string x given program p if the machine prints x after then we can see that program p computes string x in phase i of fast iff Kt-cost(p, x) i. As such, S reading all of p but no more, and write p T x (Li and Fast gives low probability to strings of high≤ Kt complexity. Vit´anyi, 2008, Def. 4.5.2). Some of these machines−→ are universal Turing machines, or ‘UTM’s. A UTM can Similarly to the above, the monotone Kolmogorov com- simulate all other machines, so that the output of U plexity of x is defined as 1 given input I(T )p (where I(T ) is a prefix-free coding Km(x) = min p p x of a Turing machine T ) is the same as the output of p {| | | → } T given input p. Furthermore, we may assume this If we define the minimand of Km as simulation occurs with only polynomial time overhead. p if p x In this paper, we fix a ‘reference’ UTM U, and when- Km-cost(p, x) := | | → otherwise ever a function f(T,... ) takes an argument T that is (∞ p a Turing machine, we will often write f(... ), where we then the Solomonoff prior M(x) = p x 2−| | can be set T to be the reference UTM. Km-cost(p,x) → written as p x 2− . M and SFast are both → P Our notation is fairly standard, with a few exceptions. semimeasures, but not measures: U P If p x, then we simply write p x. We write Definition 1. A semimeasure is a function ν : B∗ −→ → → f(n) × g(n) if f(n) = O(g(n)), and f(n) =× g(n) if [0, ) such that ν() 1 and ν(x) ν(x0) + ν(x1) for ≤ ∞ ≤ ≥ × × all x B∗. If ν satisfies these with equality, we call ν f(n) g(n) and g(n) f(n). Also, if x is some string, ∈ we denote≤ the length≤ of x by x . We write the set of a measure. | | finite binary strings as B∗, the set of infinite binary th Semimeasures can be used for prediction: sequences as B∞, an element of B∞ as x1: , the n ∞ symbol of a string x or x1: as xn, and the first n Definition 2. If ν is a semimeasure, the ν-probability symbols of any string x or ∞x as x .#A is the 1: 1:n of xt given x<t is ν(xt x<t) := ν(x1:t)/ν(x<t). cardinality of set A. Finally, we∞ write x y if string x | v is a prefix of string y, and x @ y if x is a proper prefix of y. 3 Speed priors 1A coding such that for no two different machines T and By analogy to M, we can define a variant of the T 0 is I(T ) a prefix of I(T 0). Solomonoff prior that penalises strings of high Kt com- 1395 Daniel Filan, Jan Leike, Marcus Hutter plexity more directly than SFast does: Proposition 5. p 1 Kt-cost(p,x) 2−| | S (x) := 2− = (2) SKt(x) Kt t(p, x) ≤ t p x p x x t X→ X→ X∈C SKt is a semimeasure, but is not a measure.