Loss Bounds and Time Complexity for Speed Priors

Daniel Filan Jan Leike Marcus Hutter Research School of Computer Science, Australian National University

Abstract computable (Solomonoff, 1964, 1978). The Solomonoff prior M assigns to a string x the probability that a This paper establishes for the first time the universal Turing machine prints something starting predictive performance of speed priors and with x when fed with fair coin flips. Equivalently, the their computational complexity. A speed prior distribution M can be seen as a Bayesian mixture that is essentially a probability distribution that weighs each distribution according to their Kolmogorov puts low probability on strings that are not ef- complexity (Wood et al., 2013), assigning higher a ficiently computable. We propose a variant to priori probability to simpler hypotheses (Hutter, 2007). the original speed prior (Schmidhuber, 2002), However, M is incomputable (Leike and Hutter, 2015), and show that our prior can predict sequences which has thus far limited its application. drawn from probability measures that are es- Schmidhuber has proposed a computable alternative timable in polynomial time. Our speed prior to M which discounts strings that are not efficiently is computable in doubly-exponential time, but computable (Schmidhuber, 2002). This distribution not in polynomial time. On a polynomial time is called the speed prior because asymptotically only computable sequence our speed prior is com- the computationally fastest distributions that explain putable in exponential time. We show better the data contribute to the mixture. However, no loss upper complexity bounds for Schmidhuber’s bounds for Schmidhuber’s prior, which we write as speed prior under the same conditions, and SFast, are known except in the case where the data are that it predicts deterministic sequences that drawn from a prior like Schmidhuber’s. are computable in polynomial time; however, we also show that it is not computable in poly- We introduce a prior SKt that is related to both SFast nomial time, and the question of its predictive and M, and establish in Section 3 that it is also a properties for stochastic sequences remains speed prior in Schmidhuber’s sense. Our first main open. contribution is a bound on the loss incurred by a SKt- based predictor when predicting strings drawn from a distribution that is computable in polynomial time. 1 Introduction This is proved in Section 4. The bounds we get are only a logarithmic factor worse than the bounds for We consider the general problem of sequence prediction, the Solomonoff predictor. In particular, if the measure where a sequence of symbols x1, x2, . . . , xt 1 is drawn is deterministic and the loss function penalises errors, from an unknown computable distribution−µ, and the SKt-based prediction will only make a logarithmic num- task is to predict the next symbol x . If µ belongs t ber of errors. Therefore, SKt is able to effectively learn to some known countable class of distributions, then the generating distribution µ. Our second main contri- a Bayesian mixture over the class leads to good loss bution is a proof that the same bound holds for the loss bounds: the expected loss is at most L + O(√L) where incurred by a SFast-based predictor when computing a L is the loss of the informed predictor that knows string deterministically generated in polynomial time, µ (Hutter, 2005, Thm. 3.48). These bounds are known shown in the same section. to be tight. In Section 5 we discuss the time complexity of SKt and Solomonoff’s theory of inductive inference handles the SFast. We show that SFast is computable in exponential most general case where all we know about µ is that it is time while SKt is computable in doubly-exponential time, but not in polynomial time, limiting its practical Appearing in Proceedings of the 19th International Con- ference on Artificial and Statistics (AISTATS) applicability. However, we also show that if we are 2016, Cadiz, Spain. JMLR: W&CP volume 41. Copyright predicting a sequence that is computable in polynomial 2016 by the authors. time, it only takes polynomial time to compute SFast

1394 Daniel Filan, Jan Leike, Marcus Hutter

and exponential time to compute SKt. 2.2 SFast and M Although the results of this paper are theoretical To define SFast, we first need to define the fast algo- and the algorithms impractical-seeming, related ideas rithm (called search by Li and Vit´anyi (2008), Ch. from the field of algorithmic information theory have 7.5) after which it is named. This algorithm performs been approximated and put into practice. Exam- i p phase i for each i N, whereby 2 −| | instructions ples include the Universal Similarity Metric’s use in of all programs satisfying∈ p i are executed as they clustering (Cilibrasi and Vitanyi, 2005), Solomonoff- would be on U, and the outputs| | ≤ are printed sequen- based (Veness et al., 2011), and tially, separated by blanks. If string x is computed by the Levin search-inspired Optimal Ordered Problem program p in phase i, then we write p i x. Then, Solver (Schmidhuber, 2004). However, using the the- S is defined as → ory to devise practical applications is a non-trivial task Fast that we leave for future work. ∞ i p SFast(x) := 2− 2−| | (1) i=1 p x X X→i 2 Preliminaries This algorithm is inspired by the Kt complexity of a string, defined as 2.1 Setup and notation Kt(x) = min p + log t(U, p, x) p {| | } Throughout this paper, we use monotone Turing ma- where t(U, p, x) is the time taken for program p to chines with a binary alphabet B = 0, 1 , although compute x on the UTM U, and if program p never all results generalise to arbitrary finite{ alphabets.} A computes x, we set t(U, p, x) := (Li and Vit´anyi, monotone machine is one with a unidirectional read- 2008, Def. 7.5.1). If we define the ∞Kt-cost of a compu- only input tape where the head can only move one way, tation of a string x by program p as the minimand of a unidirectional write-only output tape where the head Kt, that is, can only move one way, and some bidirectional work Kt-cost(p, x) := p + log t(p, x) tapes. We say that a monotone machine T computes | | string x given program p if the machine prints x after then we can see that program p computes string x in phase i of fast iff Kt-cost(p, x) i. As such, S reading all of p but no more, and write p T x (Li and Fast gives low probability to strings of high≤ Kt complexity. Vit´anyi, 2008, Def. 4.5.2). Some of these machines−→ are universal Turing machines, or ‘UTM’s. A UTM can Similarly to the above, the monotone Kolmogorov com- simulate all other machines, so that the output of U plexity of x is defined as 1 given input I(T )p (where I(T ) is a prefix-free coding Km(x) = min p p x of a Turing machine T ) is the same as the output of p {| | | → } T given input p. Furthermore, we may assume this If we define the minimand of Km as simulation occurs with only polynomial time overhead. p if p x In this paper, we fix a ‘reference’ UTM U, and when- Km-cost(p, x) := | | → otherwise ever a function f(T,... ) takes an argument T that is (∞ p a Turing machine, we will often write f(... ), where we then the Solomonoff prior M(x) = p x 2−| | can be set T to be the reference UTM. Km-cost(p,x) → written as p x 2− . M and SFast are both → P Our notation is fairly standard, with a few exceptions. semimeasures, but not measures: U P If p x, then we simply write p x. We write Definition 1. A semimeasure is a function ν : B∗ −→ → → f(n) × g(n) if f(n) = O(g(n)), and f(n) =× g(n) if [0, ) such that ν() 1 and ν(x) ν(x0) + ν(x1) for ≤ ∞ ≤ ≥ × × all x B∗. If ν satisfies these with equality, we call ν f(n) g(n) and g(n) f(n). Also, if x is some string, ∈ we denote≤ the length≤ of x by x . We write the set of a measure. | | finite binary strings as B∗, the set of infinite binary th Semimeasures can be used for prediction: sequences as B∞, an element of B∞ as x1: , the n ∞ symbol of a string x or x1: as xn, and the first n Definition 2. If ν is a semimeasure, the ν-probability symbols of any string x or ∞x as x .#A is the 1: 1:n of xt given x

1A coding such that for no two different machines T and By analogy to M, we can define a variant of the T 0 is I(T ) a prefix of I(T 0). Solomonoff prior that penalises strings of high Kt com-

1395 Daniel Filan, Jan Leike, Marcus Hutter

plexity more directly than SFast does: Proposition 5. p 1 Kt-cost(p,x) 2−| | S (x) := 2− = (2) SKt(x) Kt t(p, x) ≤ t p x p x x t X→ X→ X∈C SKt is a semimeasure, but is not a measure. Proof. p 3.1 Similar definitions for SFast and SKt 2−| | 1 p 1 S (x) = 2−| | Kt t(p, x) ≤ t ≤ t x t x t p x x t p x The definitions (1) of SFast and (2) of SKt have been X∈C X∈C X→ X∈C X→ by the Kraft inequality, since the fact that is a given in different forms—the first in terms of phases Ct of fast, and the second in terms of Kt-cost. In this prefix-free set guarantees that the set of programs that compute elements of is also prefix-free, due to our subsection, we show that each can be rewritten in a Ct form similar to the other’s definition, which sheds light use of monotone machines. on the differences and similarities between the two. Proposition 6. Let x1: B∞ be such that there x ∞ ∈ Proposition 3. exists a program p B∗ which outputs x1:n in f(n) 2 p ∈ 2− | | steps for all n N. Let g(n) grow faster than f(n), i.e. S (x) =× ∈ Fast t(p, x) limn f(n)/g(n) = 0. Then, p x →∞ X→ 2 p /t(p, x ) p x1:n −| | 1:n −−−−→g(n) Proof. Proof contained in supplementary material. lim ≥ = 0 n P 2 p /t(p, x ) →∞ p x1:n −| | 1:n −−−−→f(n) Proposition 4. ≤ where p xPiff program p computes string x in no −−→t ∞ i ≤ SKt(x) =× 2− 1 more than t steps.

i=1 p ix X X→ An informal statement of this proposition is that contri- Proof. Proof contained in supplementary material. butions to SKt(x1:n) by programs that take time longer than g(n) steps to run are dwarfed by those by pro- grams that take less than f(n) steps to run. Therefore, 3.2 SKt is a speed prior asymptotically, only the fastest programs contribute to

Although we have defined SKt, we have not shown any SKt. results that indicate it deserves to be called a speed prior. Two key properties of SFast justify its description Proof. Proof contained in supplementary material. as a speed prior: firstly, that the cumulative prior probability measure of all x incomputable in time t is 4 Loss bounds at most inversely proportional to t, and secondly, that if x x1: B∞, and program p B∗ computes x1:n within ∞ ∈ ∈ In this section, we prove a performance bound on SKt- at most f(n) steps, then the contribution to SFast(x1:n) by programs that take time much longer than f(n) based sequence prediction, when predicting a sequence vanishes as n (Schmidhuber, 2002). In this drawn from a measure that is estimable in polynomial subsection, we prove→ ∞ that both of these properties also time. We also prove a similar bound on SFast-based sequence prediction when predicting deterministic se- hold for SKt. SFast and SKt are the only distributions that the authors are aware of that satisfy these two quences computable in polynomial time. properties. For the purpose of this section, we write SKt somewhat more explicitly as Let t denote the set of strings x that are incomputable C p in time t (that is, there is no program p such that 2−| | SKt(x) = p x in t or fewer timesteps) such that for any y @ x, t(U, p, x) → U the prefix y is computable in time t. By definition, pXx −→ all strings that are incomputable in time t have as a and give some auxiliary definitions. Let B∗ be a 2 h·i prefix an element of t, and t is a prefix-free set (by prefix-free coding of the strings of finite length and C C N construction). Furthermore, the probability measure be a prefix-free coding of the integers, where bothh·i of of all strings incomputable in time t is simply the sum these prefix-free codings are computable and decodable of the probabilities of all elements of t. in polynomial time. C 2 Definition 7. A function f : B∗ R is finitely That is, a set such that no element is a prefix of another → element. computable if there is some Turing machine Tf that

1396 Daniel Filan, Jan Leike, Marcus Hutter

√ when given input x B∗ prints m N n N and then halts, most O( n log n) extra loss in expectation, although where f(x) = m/nh i. The functionh i h fi is finitely com- this bound will be much tighter in more structured putable in polynomial time if it takes Tf at most p( x ) environments where Λµ makes few errors, such as de- timesteps to halt on input x, where p is a polynomial.| | terministic environments.

Definition 8. Let f, g : B∗ R. g is estimable in → In order to prove this theorem, we use the following polynomial time by f if f is finitely computable in lemma: polynomial time and f(x) =× g(x). The function g Lemma 10. Let ν be a semimeasure that is finitely is estimable in polynomial time if it is estimable in computable in polynomial time. There exists a Turing polynomial time by some function f. machine Tν such that for all x B∗ ∈ p First, note that this definition is reasonably weak, since ν(x) = 2−| | (3) we only require f(x) =× g(x), rather than f(x) = g(x). Tν pXx Also note that if f is finitely computable in polynomial −→ time, it is estimable in polynomial time by itself. For and Km (x) a measure µ, estimability in polynomial time captures 2− Tν ν(x)/4 (4) our intuitive notion of efficient computability: we only ≥ where Km (x) is the length of the shortest program need to know µ up to a constant factor for prediction, Tν for x on T .4 and we can find this out in polynomial time. ν We consider a prediction setup where a predictor out- Note that a proof already exists that there is some puts a prediction, and then receives some loss de- machine Tν such that (3) holds (Li and Vit´anyi, 2008, pending on the predicted next bit and the correct Thm. 4.5.2), but it does not prove (4), and we wish next bit. More formally, we have some loss function to understand the operation of Tµ in order to prove Theorem 9. `(xt, yt) [0, 1] defined for all xt, yt B and all t N, ∈ ∈ ∈ representing the loss incurred for a prediction of yt when the actual next bit is xt, which the predictor Proof of Lemma 10. The machine Tν is essentially a observes after prediction. One example of such a loss decoder of an algorithmic coding scheme with respect to function is the 0-1 loss, which assigns 0 to a correct ν. It uses the natural correspondence between B∞ and prediction and 1 to an incorrect prediction, although [0, 1], associating a binary string x x x with the 1 2 3 ··· there are many others. real number 0.x1x2x3 . It determines the location of the input sequence··· on this line, and then assigns a We define the Λρ predictor to be the predictor Λρ certain interval for each output string, such that the which minimises ρ-expected loss, outputting yt := width of the interval for output string x is equal to argminy ρ(xt x1:t 1)`(xt, yt) at time t. If the true t xt | − ν(x). Then, if input string p lies inside the interval for distribution is µ, we judge a predictor Λ by its total the output string x, it outputs x. µ-expectedP loss in the first n steps: n Tν first calculates ν(0) and ν(1), and sets [0, ν(0)) as Λ Λ the output interval for 0 and [ν(0), ν(0) + ν(1)) as the Lnµ := Eµ `(xt, yt ) "t=1 # output interval for 1. It then reads the input, bit by bit. X In particular, if we are using 0-1 loss, LΛ is the ex- After reading input p1:n, it constructs the input interval nµ [0.p p p , 0.p p p 111111 ), which represents pected number of errors made by Λ up to time n in 1 2 ··· n 1 2 ··· n ··· the environment µ. the inerval that 0.p1p2 pnpn+1 could lie in. It then checks if this input··· interval··· is contained in one Theorem 9 (Bound on SKt prediction loss). If µ is a of the output intervals. If it is, then it prints output measure that is estimable in polynomial time by some appropriate for the interval, and if not, then it reads semimeasure ν, and x is a sequence sampled from µ, one more bit and repeats the process. then the expected loss incurred by the ΛSKt predictor is bounded by Suppose the first output bit is a 1. Then, Tν cal- culates ν(10) and ν(11), and forms the new output Λ SKt Λµ Λµ Lnµ L 2D + 2 Lnµ D intervals: [ν(0), ν(0) + ν(10)) for outputting 0, and − nµ ≤ n n 3 q [ν(0) + ν(10), ν(0) + ν(10) + ν(11)) for outputting 1. It where Dn = O(log n). then reads more input bits until the input interval lies Λµ within one of these new output intervals, and then out- Since Lnµ n, this means that Λ only incurs at ≤ SKt 4 3A similar bound that can be proved the same way is Note that this lemma would be false if we were to let ν Λ be an arbitrary lower-semicomputable semimeasure, since SKt Λµ Lnµ Lnµ √2Dn for the same Dn (Hutter, 2007, Km(x) − ≤ if ν = M, this would imply that 2− =× M(x), which qEq. 8, 5). q was disproved by G´acs(1983).

1397 Daniel Filan, Jan Leike, Marcus Hutter

Km (x) puts the appropriate bit. The computation proceeds 2− Tν in this fashion. ≥ (g( x ) O(1) log(µ(x)))O(1) | | − p µ(x) Equation (3) is satisfied, because Tν 2−| | is just × p x O(1) (5) −→ ≥ (g( x ) O(1) log(µ(x))) the total length of all possible inputP intervals that fit | | − inside the output interval for x, which by construction Now, the unit loss bound tells us that Λ is ν(x). SKt Λµ Λµ Lnµ Lnµ 2Dn(µ SKt)+2 Lnµ Dn(µ SKt) (6) Km (x) − ≤ || || To show that (4) is satisfied, note that 2− Tν is the where Dn(µ SKt) := Eµ [ln (µ(x1:qn)/SKt(x1:n))] is the length of the largest input interval for x. Now, input || relative entropy. We can calculate Dn(µ SKt) using intervals are binary intervals (that is, their start points equation (5): || and end points have a finite binary expansion), and for every interval I, there is some binary interval contained µ(x1:n) Dn(µ SKt) = Eµ ln in I with length 1/4 that of I. Therefore, the output || SKt(x1:n) ≥   interval for x contains some input interval with length O(1) × Eµ ln (g(n) O(1) log(µ(x1:n))) at least 1/4 that of the length of the output interval. ≤ − h  i Since the length of the output interval for x is just × Eµ [ln(g(n) O(1) log(µ(x1:n)))] Km (x) ≤ − ν(x), we can conclude that 2 Tν ν(x)/4. − ln Eµ [g(n) O(1) log(µ(x1:n))] (7) ≥ ≤ − = ln (g(n) + O(1)Hµ(x1:n))

Proof of Theorem 9. Using Lemma 10, we show a where Hµ(x1:n) denotes the binary entropy of the ran- bound on SKt that bounds its KL divergence with dom variable x1:n with respect to µ µ. We then apply the unit loss bound (Hutter, 2005, ln (g(n) + O(n)) =× log n (8) Thm. 3.48) (originally shown for the Solomonoff prior, ≤ where (7) comes from Jensen’s inequality. Equations but valid for any prior) to show the desired result. (6) and (8) together prove the theorem. First, we reason about the running time of the shortest program that prints x on the machine Tν (defined in We therefore have a loss bound on the SKt-based se- Lemma 10). Since we would only calculate ν(y0) and quence predictor in environments that are estimable in ν(y1) for y x, this amounts to 2 x calculations. Each polynomial time by a semimeasure. Furthermore: v | | calculation need only take polynomial time in the length Corollary 11. of its argument, because Tν could just simulate the ΛSKt Lnµ 2D (µ S ) =× log n machine that takes input x and returns the numerator ≤ n || Kt and denominator of x, prefix-free coded, and it only for deterministic measures5 µ computable in polyno- takes polynomial time to undo this prefix-free coding. mial time, if correct predictions incur no loss. Therefore, the calculations take at most 2 x f( x ) =: g( x ), where f is a polynomial. We also, however,| | | need| We should note that this method fails to prove similar | | to read all the bits of the input, construct the input bounds for SFast, since we instead get 2 p 2 intervals, and compare them to the output intervals. 2− | | µ(x) S (x) =× × This takes time linear in the number of bits read, and for Fast t(U, p, x) ≥ ( x O(1) log µ(x))O(1) U the shortest program that prints x, this number of bits pXx | | − KmTν (x) −→ is (by definition) KmTν (x). Since 2− =× ν(x), (9)

KmTν (x) log(ν(x)) + O(1), and since ν(x) =× µ(x), log(ν(x))≤ − log(µ(x)) + O(1). Therefore, the total which gives us − ≤ − time taken is bounded above by g( x ) O(1) log(µ(x)), µ(x1:n) | | − Dn(µ SFast) = Eµ ln where we absorb the additive constants into g( x ). || S (x ) | |  Fast 1:n  This out of the way, we can calculate O(log n) + Hµ(x1:n) ≤ p 2−| | Since Hµ(x1:n) can grow linearly in n (for example, SKt(x) = x t(U, p, x) take µ to be λ(x) = 2−| |, the uniform measure), this U pXx can only prove a trivial linear loss bound without re- −→ q strictions on the measure µ. It is also worth explicitly I(T ) 2−| | = 2−| | noting that the constants hidden in the O( ) notation t(T, q, x)O(1) Turing machines T T · X qXx depend on the environment µ, as will be the case for −→ p the rest of this paper. 2−| | × O(1) 5That is, measures that give probability 1 to prefixes of ≥ t(Tν , p, x) Tν pXx one particular infinite sequence. −→ 1398 Daniel Filan, Jan Leike, Marcus Hutter

One important application of Theorem 9 is to the Finally, we investigate the time taken to compute 0-1 loss function. Then, it states that a predictor SKt and SFast along a polynomial-time computable that outputs the most likely successor bit according sequence x1: . If we wanted to predict the most likely to S only makes logarithmically many errors in a continuation∞ of x according to S S ,S , we Kt 1:n ∈ { Kt Fast} deterministic environment computable in polynomial would have to compute an approximation to S(x1:n0) time. In other words, SKt quickly learns the sequence and S(x1:n1), to see which one was greater. We show it is predicting, making very few errors. that it is possible to compute these approximations in polynomial time for S and in exponential time for Next, we show that S makes only logarithmically Fast Fast S : an exponential improvement over the worst-case many errors on a sequence deteriministically computed Kt bounds in both cases. in polynomial time. This follows from a rather simple argument. 5.1 Upper bounds Theorem 12 (Bound on SFast prediction loss). Let µ be a deterministic environment and x be the 1: Theorem 13 (S computable in exponential time). sequence whose prefixes µ assigns probability∞ 1 to. If Fast For any ε > 0, there exists an approximation Sε of x is computable in polynomial time by a program Fast 1: S such that Sε /S 1 ε and Sε (x) is px,∞ then S only incurrs logarithmic loss, if correct Fast Fast Fast Fast Fast computable in time| exponential− | in ≤x . predictions incur no loss. | |

Proof. Using the unit loss bound, Proof. First, we note that in phase i of fast, we try out 21 + + 2i = 2i+1 program prefixes p, and each ΛSFast ΛSFast Λµ Lnµ = Lnµ Lnµ ··· i p − prefix p gets 2 −| | steps. Therefore, the total number of 1 i 1 2 i 2 i i i 2Dn(µ SFast) steps in phase i is 2 2 − +2 2 − + +2 2 − = ≤ || i × × ··· × = 2 ln SFast(x1:n) i2 , and the total number of steps in the first k phases − is × x x 2 p + log t(p , x1:n) k ≤ | | i k+1 =× log n # steps = i2 = 2 (k 1) + 2 (10) − i=1 X 5 Time complexity Now, suppose we want to compute a sufficient approxi- ε mation SFast(x). If we compute k phases of fast and Although it has been proved that SFast is com- then add up all the contributions to SFast(x) found in putable (Schmidhuber, 2002), no bounds are given those phases, the remaining contributions must add up i k for its computational complexity. Given that the major to ∞ 2− = 2− . In order for the contributions ≤ i=k+1 advantage of S -based prediction over M-based pre- we have added up to contribute 1 ε of the total, it Fast P ≥ − diction is its computability, it is of interest to determine suffices to use k such that the time required to compute S , and whether such Fast k = log(εSFast(x)) + 1 (11) a computation is feasible or not. The same questions b− c x apply to SKt, to a greater extent because we have not Now, since the uniform measure λ(x) = 2−| | is finitely even yet shown that SKt is computable. computable in polynomial time, it is estimable in poly- In this section, we show that an arbitrarily good approx- nomial time by itself, so we can substitute λ into equa- tion (9) to obtain imation to SFast(x) is computable in time exponential 2 x in x , and an arbitrarily good approximation to SKt(x) 2− | | 1 | | S (x) × = is computable in time doubly-exponential in x . We do Fast O(1) x O(1) O(1) 2 x | | ≥ ( x + log(2| |)) x 2 | | this by explicitly constructing algorithms that perform | | | | (12) phases of fast until enough contributions to S or Fast Substituting equation (12) into equation (11), we get SKt are found to constitute a sufficient proportion of 2 x O(1) the total. k log O(2 | | x )/ε + 1 ≤ | | We also show that no such approximation of SKt or = log ε + 2 x + O(logx ) (13) S can be computed in polynomial time. We do − | | | | Fast So, substituting equation (13) into equation (10), this by contradiction: showing that if it were possible log ε+2 x +O(log x )+1 to do so, we would be able to construct an ‘adversar- # steps 2− | | | | ≤ ial’ sequence that was computable in polynomial time, ( log ε + 2 x + O(log x ) 1) + 2 yet could not be predicted by our approximation; a × − | | | | − 1 2 x O(1) contradiction. = 2 | | x ( log ε + 2 x + O(log x )) ε | | − | | | |

1399 Daniel Filan, Jan Leike, Marcus Hutter

O( x ) 2 | | However, by design, Λ ε errs every time when predict- SKt ≤ ε Therefore, Sε is computable in exponential time. ing x1: . This is a contradiction, showing that SKt Fast cannot∞ be computable in polynomial time. Theorem 14 (SKt computable in doubly-exponential time). For any ε > 0, there exists an approximation ε ε ε Next, we provide a proof of the analogous theorem SKt of SKt such that SKt/SKt 1 ε and SKt is computable in time doubly-exponential| − | ≤ in x . for Schmidhuber’s speed prior SFast, using a lemma | | about the rate at which SFast learns polynomial-time computable deterministic sequences. Proof. Proof contained in supplementary material. Theorem 17 (SFast not computable in polynomial time). For no ε > 0 does there exist an approxima- 5.2 Lower bounds ε ε tion SFast of SFast such that SFast/SFast 1 ε and Sε (x) is computable in time| polynomial− in| ≤x . Theorem 15 (SKt not computable in polynomial time). Fast | | For no ε > 0 does there exist an approximation Sε of Kt Lemma 18. For a sequence x1: computed in poly- ε ε ∞ SKt such that SKt/SKt 1 ε and SKt is computable nomial time by some program px, in time polynomial| in x−. | ≤ | | n 1 S (x x ) × log n | − Fast t |

ΛSε Kt Λµ Λµ 1 SFast(xt x

1400 Daniel Filan, Jan Leike, Marcus Hutter

5.3 Computability along polynomial time SKt and µα,γ are similar in spirit, in that they predict computable sequences measures that are easy to compute. However, the con- trast between the two is instructive: µα,γ requires one Theorem 19 (SFast computable in polynomial time to fix α and γ in advance, and only succeeds on (α, γ)- on polynomial time computable sequence). If x1: is simple measures. Therefore, there are many polynomi- ε ∞ computable in polynomial time, then SFast(x1:n0) and als γ0 > γ such that µα,γ cannot predict (α, γ0)-simple ε SFast(x1:n1) are also computable in polynomial time. measures. We are therefore required to make an arbi- trary choice of parameters at the start and are limited Proof. Proof contained in supplementary material. by that choice of parameters. In contrast, SKt predicts all measures estimable in polynomial time, and does Theorem 20 (S computable in exponential time Kt not require some polynomial to be fixed beforehand. on polynomial time computable sequence). If x is 1: S -based prediction therefore is more general than computable in polynomial time, then Sε (x 0)∞ and Kt Kt 1:n that of µ . ε nO(1) α,γ SKt(x1:n1) are computable in time 2 . Further questions remain to be studied. In particular, Proof. Proof contained in supplementary material. we do not know whether the loss bounds on speed- prior-based predictors can be improved. We also do not know how to tighten the gap between the lower Note that Theorem 19 does not contradict Theorem and upper complexity bounds on the speed priors. 17, which merely states that there exists a sequence for which SFast is not computable in polynomial time, It would also be interesting to generalise the defini- and does not assert that SFast must be computable in tion of SKt. Our performance result was due to the superpolynomial time for every sequence. fact that for all measures µ estimable in polynomial time, S (x) µ(x)/(f( x , log µ(x))), where f was Kt ≥ | | − 6 Discussion a polynomial. Now, if µ is estimable in polynomial time by ν, then the denominator of the fraction ν(x) In this paper, we have shown for the first time a loss must be small enough to be printed in polynomial time. bound on prediction based on a speed prior. This This gives an exponential bound on 1/ν(x), and there- fore a polynomial bound on log µ(x). We therefore was proved for SKt, and we suspect that the result for − have that SKt(x) µ(x)/g( x ) for a polynomial g. stochastic sequences is not true for SFast, due to weaker ≥ | | Because g is subexponential, this guarantees that SKt bounds on its KL divergence with the true environment. 6 However, in the special case of deterministic sequences, converges to µ (Ryabko and Hutter, 2008). This sug- gests a generalisation of SKt that takes a mixture over we show that SFast has the same performance as SKt. We have also, again for the first time, investigated the some class of measures, each measure discounted by efficiency of computing speed priors. This offers both its computation time. Loss bounds can be shown in the same manner as in this paper if the measures are encouraging and discouraging news: SKt is good at prediction in certain environments, but is not efficiently computable in polynomial time, but the question of computable, even in the restricted class of environments the computational complexity of this mixture remains where it succeeds at prediction. On the other hand, completely open. SFast is efficiently computable for certain inputs, and succeeds at predicting those sequences, but we have Acknowledgements no evidence that it succeeds at prediction in the more The authors would like to thank the reviewers for general case of stochastic sequences. this paper, the Machine Intelligence Research Insti- To illustrate the appeal of speed-prior based inference, tute for funding a workshop on Schmidhuber’s speed it is useful to contrast with a similar approach intro- prior, which introduced the first author to the concept, duced by Vovk (1989). This approach aims to pre- and Mayank Daswani and Tom Everitt for valuable dict certain simple measures: if α and γ are functions discussions of the material. This work was in part N N, then a measure ν is said to be (α, γ)-simple supported by ARC grant DP150104590. → ν if there exists some ‘program’ π B∞ such that the UTM with input x outputs ν(x)∈ in time γ( x ) by reading only α( x ) bits of πν . Vovk proves≤ that| | if α is logarithmic and| | γ is polynomial, and if both α and γ are computable in polynomial time, then there ex- ists a measure µα,γ which is computable in polynomial 6To see that g must be subexponential for good predic- x time that predicts sequences drawn from (α, γ)-simple tive results, note that for all measures µ, λ(x) µ(x)/2| |, measures. but λ does not predict well. ≥

1401 Daniel Filan, Jan Leike, Marcus Hutter

References Rudi Cilibrasi and Paul Vitanyi. Clustering by com- pression. IEEE Transactions on Information Theory, 51(4):1523–1545, 2005. P´eterG´acs. On the relation between descriptional complexity and algorithmic probability. Theoretical Computer Science, 22(12):71 – 93, 1983. Marcus Hutter. Universal Artificial Intelligence: Se- quential Decisions Based on Algorithmic Probability. Springer Science & Business Media, 2005. Marcus Hutter. On universal prediction and Bayesian confirmation. Theoretical Computer Science, 384(1): 33–48, 2007. Jan Leike and Marcus Hutter. On the computability of Solomonoff induction and knowledge-seeking. In Al- gorithmic Learning Theory, pages 364–378. Springer, 2015. Ming Li and Paul M.B. Vit´anyi. An Introduction to Kol- mogorov Complexity and Its Applications. Springer Science & Business Media, 2008. Daniil Ryabko and Marcus Hutter. Predicting non- stationary processes. Applied Mathematics Letters, 21(5):477–482, 2008. J¨urgenSchmidhuber. The speed prior: a new simplicity measure yielding near-optimal computable predic- tions. In Computational Learning Theory, pages 216–228. Springer, 2002. J¨urgenSchmidhuber. Optimal ordered problem solver. Machine Learning, 54(3):211–254, 2004. Ray J. Solomonoff. A formal theory of inductive in- ference. part I. Information and Control, 7(1):1–22, 1964. Ray J. Solomonoff. Complexity-based induction sys- tems: comparisons and convergence theorems. IEEE Transactions on Information Theory, 24(4):422–432, 1978. Joel Veness, Kee Siong Ng, Marcus Hutter, William Uther, and David Silver. A Monte-Carlo AIXI ap- proximation. Journal of Artificial Intelligence Re- search, 40(1):95–142, 2011. Vladimir G. Vovk. Prediction of stochastic sequences. Problemy Peredachi Informatsii, 25(4):35–49, 1989. Ian Wood, Peter Sunehag, and Marcus Hutter. (Non- )equivalence of universal priors. In Algorithmic Prob- ability and Friends. Bayesian Prediction and Artifi- cial Intelligence, pages 417–425. Springer, 2013.

1402