Foundations of Probability Theory for AI-The Application of Algorithmic Probability to Problems in Artificial Intelligence

Topic• Foundations of Probability Theory for A.I. T the string x,such that MCx> is as large as I possible. Many engineering problems are of this sort - for. example, designing an Title& THE APPLICATION OF ALGORITHMIC automobile in 6 months satisfying certain PROBABILITY TO PROBLEMS IN specifications having minimum cost. ARTIFICIAL INTELLIGENCE• Constructing the best possible probability I distribution or physical theory from Authora Ray Solomonoff, oxbridge Research empirical data in limited time is also of Box 559, Cambridge, Mass. 02238 this fonn. In solving either machlne inversion p I IN TRODUCTION I This aper covers two problems or time limited optimization topics• first an introduction to Algorithmic problems, it is usually possible to express Complexity Theoryl how it defines the information needed to solve the problem probability, some of its characteristic (either heuristic information or properties and past successful applications. problem-specific information> by means of a I Second, we apply it to problems in condition�! probability distribution. This A.I. -where it promises to give near distribution relates the string that optimum search procedures for two very broad describes the problem to any string that is classes of problems. a candidate solution to the problem. If all I Algorithmic probability of a string, s, of the information needed to solve the is the probability of that particular string problem is in this probability distribution being produced as the output of a reference and we do not modify the distribution during universal Turing machine with random input. the search, then·Levin's search technique is It is approximately r.tttl , where ..lCp> is within a factor of 2 of optimum. I the length of the shortest program, p, that It we . are allowed to modify the will produce s as output. �Cp> is the probability distribution during the search Kolmogorov complexity of s - one form of on the basis of our experience in attempting Algorithmic complexity. to solve the problem, then Levin's technique I Algorithmic complexity has been applied is within a factor of 4 of optimum. to many areas of science and mathematics. The efficacy of this problem solving It has given us a very good understanding of technique hinges on our ability to represent randomness £61. It has been used to all of our relevant knowledge in a generalize information theory (21 and has probability distribution. I clarified important concepts in the To what extent is this possible? For foundations of mathematics Cll. It has one broad area of knowledge this is given us our first truly complete definition certainly easy to dol this is the kind of of probability (7,8,11 J. inductive knowledge·obtained from a set of The completeness property of examples of correctly worked problems. I Algorithmic probability means that if'there Algorithmic probability obtained from is any describable regularity in a body of examples of this sort is in just the right data, our system is quaranteed to discover form for application of our general problem it using a relatively small sample of the solving system. Furthermore, when we have I data. It is the only probability evaluation· other kinds of information that we want to method known to be complete. As a necessary express as a probability distribution we can consequence of its completeness, this kind usually hypothesize a sequence ot examples of probability must be incomputable. that would lead to the learning of that Conversely, any computable probability information by a human. We can then give I measure cannot be complete. �hat set of examples to our induction system Can we use this incomputable and it will acquire the same information in probability measure to obtain solutions to appropriate probabilistic form. practical problems? A large step in this rlhile it is possible to put most kinds I direction was Levin's search procedure that of information into probabilistic form using obtains a solution to any P or NP probl�m this technique, a person can, with some within a constant factor of optimum time. experience, learn to bypass this process and The 11constant factor" may, in general, be express the desired information directly in quite large. While this technique does not probabilistic form. We will show how this I use a complete probability measure, it uses can be done for certain kinds of heuristic a measure that approaches completeness. information such as Planning, Analogy, Under certain reasonable conditions and Clustering and Frame Theory. for a broader class of problems than Levin The use of a probability distribution I originally considered, the 11constant factor11 to represent knowledge, not only simplifies must be less than about four. the solution of problems, but it enables us The P or NP .class ori ginally considered to put information from many different kinds contained machine inversion problems• we are at problem solving systems into a common given a string, s,' and a machine, M, that to�mat. Then, using techniques that are I maps strings to strings. We must find in tundamental to algorithmic complexity minimum time, a string, x, such that theory, we can compress this heterogeneous • MCx> = s Solving algebraic equations, mass of information into a more compact, symbOlic integration and theorem proving are unified form. This operation corresponds to I examples of this broad class of problems. Kepler's laws summarizing and compressing . However, Levin's search procedure also Tycho Brahe's empirical data on planetary , applies to another broad class of problems - motion. Algorithmic complexity theory has Time limited optimization problems. Given a this ability to synthesize, to find general time limit T , and a machine M that maps laws in masses of unorganized and partially I strings to real numbers, to find within time organized knowledge. It is in this area that its greatest value for A.I. lies. I wi ll conclude with a discussion of * Much of the content of sections I and II was presented at a workshop on 11Theories of the present state ot the system and the I Complexity11, Cambridge, Mass., August, 1984. outstanding problems. 48 I I . ALGOR lTHMI C COM� LEX ITY I .. I_ •• E <P Ca..,,. = lla,a • a > The earliest application of algorithmic ..., , l ,_ .... complexity was to devise a formal theory of 1' "1=1 '' ••• j2" inductive inference (7,8,11 l. All induction - P<a....,., = 1j a, a,_ a...,>>.. < b ln I problems are equivalent to the problem of The expected value is with respect to probability distribution, P extrapolating a long sequence of symbols. • This means that the expected value of Formally, we can do this extrapolation by the sum of the squares of the deviations of Baye�s theorem, if we are able to assign an P� from P is bounded by a constant. apriori probability to any conceivable I fhis error is much less than that given string of symbols, x. by conventional statistics - which is This can be done in the following proportional to ln n. The disparity is manner: Let x be a string of n binary symbols. Let M be a universal Turing because P is describable by a finite string I machine with 3 tapes: a unidirectional of symbols. Usually statistical models have input tape; unidirectional output tape; and parameters that have an infinite number of an infinitely long bidirectional work tape. bits in them and so the present analysis The unidirectionality of the input and must be applied to them in somewhat modified output tapes assure us that if M(s) =y , form. The smallness of this error assures I then M<ss'> =yy' �i.e. if sis the code us that if we are given a stochastic of y, then if we extend s by several sequence created by an unknown generator, we symbols, the output of M will be at least y can use PM to obtain the conditional and may possibly <though not necessarily> be probabilities of that generator with much I followed by other symbols. accuracy • •�e assign an apriori probability to the Tom Cover [3, also I I, pp. 425 l has string x, by repeatedly flipping a coin and shown that if PM is made the basis of a giving the ma chine M, an input I whenever we universal gambling scheme, its yield will be have �heads• or 0 whenever we hav e "tails�. extremely large. I There is some probability PM(x) that this It is clear that PM depends on just random binary input string wi 11 cause M to what universal machine M, is used. However, have as output a string whose first n bits if we use a lot of data for our induction, are identical to x. �hen constructed in then the probability values are relatively I this manner with respect to universal Turing insensitive to choice of M. This will be true even if we include as data, information machine, M, PM<x> becomes the celebrated universal apriori probability distribution. not directly related to the probabilities we Conceptually, it is easy to calculate are calculating. • �e believe that P gives us about the P,., <x> Suppose s, , s... , s� •• • are all of M I best probabiliity values that are obtainable the possible input strings to M that can with the available produce x Cat least> as output. Let information. s( , s1 , st ••• be a maximal subset of �hile PM has many desireable properties, it cannot ever be used directly [si J such that no sf can be formed by to obtain probability values. As a I adding bits onto the end of some other sf • necessary consequence of its 11completeness·11 - Thus no st can be the "prefix" of any - its ability to discover the regularities other s� • " in any reasonable PM The probability of sr being produced sample of data - must -�(5:) be uncomputable.

Load more