Topic• Foundations of Probability Theory for A.I. T the string x,such that MCx> is as large as I possible. Many engineering problems are of this sort - for. example, designing an Title& THE APPLICATION OF ALGORITHMIC automobile in 6 months satisfying certain PROBABILITY TO PROBLEMS IN specifications having minimum cost. ARTIFICIAL INTELLIGENCE• Constructing the best possible probability I distribution or physical theory from Authora , oxbridge Research empirical data in limited time is also of Box 559, Cambridge, Mass. 02238 this fonn. In solving either machlne inversion p I IN TRODUCTION I This aper covers two problems or time limited optimization topics• first an introduction to Algorithmic problems, it is usually possible to express Complexity Theoryl how it defines the information needed to solve the problem probability, some of its characteristic (either heuristic information or properties and past successful applications. problem-specific information> by means of a I Second, we apply it to problems in condition�! probability distribution. This A.I. -where it promises to give near distribution relates the string that optimum search procedures for two very broad describes the problem to any string that is classes of problems. a candidate solution to the problem. If all I Algorithmic probability of a string, s, of the information needed to solve the is the probability of that particular string problem is in this probability distribution being produced as the output of a reference and we do not modify the distribution during universal Turing machine with random input. the search, then·Levin's search technique is It is approximately r.tttl , where ..lCp> is within a factor of 2 of optimum. I the length of the shortest program, p, that It we . are allowed to modify the will produce s as output. �Cp> is the probability distribution during the search of s - one form of on the basis of our experience in attempting Algorithmic complexity. to solve the problem, then Levin's technique I Algorithmic complexity has been applied is within a factor of 4 of optimum. . to many areas of science and mathematics. The efficacy of this problem solving It has given us a very good understanding of technique hinges on our ability to represent £61. It has been used to all of our relevant knowledge in a generalize (21 and has probability distribution. I clarified important concepts in the To what extent is this possible? For foundations of mathematics Cll. It has one broad area of knowledge this is given us our first truly complete definition certainly easy to dol this is the kind of of probability (7,8,11 J. inductive knowledge·obtained from a set of The completeness property of examples of correctly worked problems. I Algorithmic probability means that if'there Algorithmic probability obtained from is any describable regularity in a body of examples of this sort is in just the right data, our system is quaranteed to discover form for application of our general problem it using a relatively small sample of the solving system. Furthermore, when we have I data. It is the only probability evaluation· other kinds of information that we want to method known to be complete. As a necessary express as a probability distribution we can consequence of its completeness, this kind usually hypothesize a sequence ot examples of probability must be incomputable. that would lead to the of that Conversely, any computable probability information by a human. We can then give I measure cannot be complete. �hat set of examples to our induction system Can we use this incomputable and it will acquire the same information in probability measure to obtain solutions to appropriate probabilistic form. practical problems? A large step in this rlhile it is possible to put most kinds I direction was Levin's search procedure that of information into probabilistic form using obtains a solution to any P or NP probl�m this technique, a person can, with some within a constant factor of optimum time. experience, learn to bypass this process and The 11constant factor" may, in general, be express the desired information directly in quite large. While this technique does not probabilistic form. We will show how this I use a complete probability measure, it uses can be done for certain kinds of heuristic a measure that approaches completeness. information such as Planning, Analogy, Under certain reasonable conditions and Clustering and Frame Theory. for a broader class of problems than Levin The use of a probability distribution I originally considered, the 11constant factor11 to represent knowledge, not only simplifies must be less than about four. the solution of problems, but it enables us The P or NP .class ori ginally considered to put information from many different kinds contained machine inversion problems• we are at problem solving systems into a common given a string, s,' and a machine, M, that to�mat. Then, using techniques that are I maps strings to strings. We must find in tundamental to algorithmic complexity minimum time, a string, x, such that theory, we can compress this heterogeneous

• MCx> = s Solving algebraic equations, mass of information into a more compact, symbOlic integration and theorem proving are unified form. This operation corresponds to I examples of this broad class of problems. Kepler's laws summarizing and compressing . However, Levin's search procedure also Tycho Brahe's empirical data on planetary , applies to another broad class of problems - motion. Algorithmic complexity theory has Time limited optimization problems. Given a this ability to synthesize, to find general time limit T , and a machine M that maps laws in masses of unorganized and partially I strings to real numbers, to find within time organized knowledge. It is in this area that its greatest value for A.I. lies. I wi ll conclude with a discussion of * Much of the content of sections I and II was presented at a workshop on 11Theories of the present state ot the system and the I Complexity11, Cambridge, Mass., August, 1984. outstanding problems. 48 I I . ALGOR lTHMI C COM� LEX ITY

I .. I_ •• E

The earliest application of algorithmic ..., , l ,_ .... complexity was to devise a formal theory of 1' "1=1 '' ••• j2" inductive inference (7,8,11 l. All induction - P>.. < b ln I problems are equivalent to the problem of The expected value is with respect to probability distribution, P extrapolating a long sequence of symbols. • This means that the expected value of Formally, we can do this extrapolation by the sum of the squares of the deviations of Baye�s theorem, if we are able to assign an P� from P is bounded by a constant. apriori probability to any conceivable I fhis error is much less than that given string of symbols, x. by conventional statistics - which is This can be done in the following proportional to ln n. The disparity is manner: Let x be a string of n binary symbols. Let M be a universal Turing because P is describable by a finite string I machine with 3 tapes: a unidirectional of symbols. Usually statistical models have input tape; unidirectional output tape; and parameters that have an infinite number of an infinitely long bidirectional work tape. bits in them and so the present analysis The unidirectionality of the input and must be applied to them in somewhat modified output tapes assure us that if M(s) = y , form. The smallness of this error assures I then M =yy' �i.e. if sis the code us that if we are given a stochastic of y, then if we extend s by several sequence created by an unknown generator, we symbols, the output of M will be at least y can use PM to obtain the conditional and may possibly be probabilities of that generator with much I followed by other symbols. accuracy • •�e assign an apriori probability to the Tom Cover [3, also I I, pp. 425 l has string x, by repeatedly flipping a coin and shown that if PM is made the basis of a giving the ma chine M, an input I whenever we universal gambling scheme, its yield will be have �heads• or 0 whenever we hav e "tails�. extremely large. I There is some probability PM(x) that this It is clear that PM depends on just random binary input string wi 11 cause M to what universal machine M, is used. However, have as output a string whose first n bits if we use a lot of data for our induction, are identical to x. �hen constructed in then the probability values are relatively I this manner with respect to universal Turing insensitive to choice of M. This will be true even if we include as data, information machine, M, PM becomes the celebrated universal apriori probability distribution. not directly related to the probabilities we Conceptually, it is easy to calculate are calculating. • �e believe that P gives us about the P,., Suppose s, , s ... , s� •• • are all of M I best probabiliity values that are obtainable the possible input strings to M that can with the available produce x Cat least> as output. Let information. s( , s1 , st ••• be a maximal subset of �hile PM has many desireable properties, it cannot ever be used directly [si J such that no sf can be formed by to obtain probability values. As a I adding bits onto the end of some other sf • necessary consequence of its 11completeness·11 - Thus no st can be the "prefix" of any - its ability to discover the regularities other s� • " in any reasonable PM The probability of sr being produced sample of data - must -�(5:) be uncomputable. However, approximations I to PM are always possible and we will by random coin tossing is just 2 , .R • where • programs for the sequence x correspond to I To do prediction with PMCxl is very . regularities in x. It x was a sequence of a s1mple. million l's we could describe x in few The probability that x will be followed by a �mrds and write a short program to generate I rat her than 0 is it. If x was a random sequence with no . P Cxi)/(P CXO) P,.1 can never know that a sequence is random. is a conditional orobability distribution All we can ever know is that we have soent a for the n+l"" bit of a binary string, given lot of time looking for regularities in it I ••• the previous n bits, a,a.. a3 a�. Let us and we've not found any. However, no matter further postulate that P is describable by how long we have looked, we can't be sure machine M with a program b bits long. that we wouldn't find a regularity if we Let P,.,ca.,.,=l(a,a._••• a") be the looked for 10 minutes more! I corresponding probability distribution based Any legitimate regularity in x can be on PM. Using P, and a suitable source of used to write a shorter code for it. This randomness, we can generate a stochastic makes it possible to give a clear criterion sequence A=a, a._a�··· a� • 'Both P and PM for succe ss to a machine that ts searching are able to assign probabilities to the for regularities in a body of data. It is I occurrence of the symbol I at any point in an adequate basis for the mechanization of the sequence A based on the previous symbols inductive inference. in A. It has been shown Cll,pp. 426-427] that I the total expected squared error between p and PM is given by I 49 Ahother broad class of problems are II. A GENERAL SYSTEM FOR SOLVING time limited optimization problems. Suppose I PROBLEMS we have a known program M, that operates on a number or string of symbols and produces The problems solvable by the system as output, a real number between zero and fall in two broad classes: machine inversion one. The problem is to find an input that problems and time limited optimization gives the largest possible output and we are I problems. In both, the problem itself as given a fixed time,1', in which to do this. well as the solution, can be represented by Many engineering problems are of this a finite string of symbols. sort. For example, consider the problem of �e will try to show that most, if not designing a rocketship satisfying certain I all knowledge needed for problem solving can specifications, having minimal cost, within be expressed as a conditional probability the time limit of 5 years. distribution relating the problem string Another broad class of optimization (condition> to the probability of various problems of this sort are induction other strings being solutions. �e shall be problems. An example is devising the best I interested in probability distributions that possible set of physical laws to explain a list possible solutions with their certain set of data and doing this within a associated probability values in decreasing certain restricted time. order of probability. It should be noted that devising a good I �e will use Algorithmic complexity functional form for M - the criterion for theory to create a probability distribution how good the theory fits the data, is not in of this sort. itself part of the optimization problem. A Then, considerations of Computational good functional form can, however, be Complexity lead to a near optimum method to I obtained from algorithmic complexity theory. search for solutions. The problem of extrapolating time rle will discuss the advantages of this series involves optimizing the form of method of knowledge representation - how it prediction function, and so it, too can be leads to a method of unifying the Babel of regarded as an optimization problem. I disparate techniques used in various Another form of induction is "operator existing problem solving systems. inductionn. Here we are given an unordered sequence of ordered pairs of objects such as Kinds of Problems that the System (1,1>, C7,49l, C-3,9), etc, The problem is Can Solve. to fi nd a simple functional form relating I the first element of each pair Cthe "input10l Almost all problems in science and to the second element Cthe "output">. In mathematics can be well approximated or the example given, the optimum·is easy to expressed exactly as either machine find, but if the functional form is not I inversion problems or time limited simple and noise is added to the output, the optimization problems. problems can be quite difficult. Machine inversion problems include NP Some "analogy" problems on I .a. tests and P problems. They are problems of are forms of operator induction. finding a number or other string of symbols In the most general kind of induction, I satisfying certain specified constraints. the permissible form of the prediction For example, to solve x + sin x = 3 , we function is very general, and it is must find a string of symbols, i.e. a impossible to know if any particular number, x that satisfies this equation. function is the best possible - only that it I Problems of this sort can always be is the best found thus far. In such cases • expre ssed in the form MCx) = c Here M is the unconstrained optimization problem is a computing machine with a known program undefined and including a time limit that operates on the number x. The problem constraint is one useful way to give an is to find an x such that �he output of the I exact definition to the problem. program is c. All of these constrained optimization Symbolic integration is another example problems are of the forma given a program of machine inversion. For example we might M, � to find string x in time '1"' such that xe want the indefinite integral of • . MCx> : maximum. In the examples given, we I Suppose M is a computer program that always knew what the program M was. operates on a string of symbols that However, in some cases M may be a 10black represent an algebraic expression and box" and we are only allowed to make trial obtains a new string of symbols representing inputs and remember the resultant outputs. the derivative of the input string. We want In other forms of the optimization problem, I a string of symbols, s such that M may be time varying and/or have a randomly = • M ( s) xe•... varying com ponent. Finding proofs of theorems is also an �e will discuss at this time only the inversion problem. case in which the nature of M is known and I Let Th be a string of symbols that is constant in time. Our methods of represents a theorem. solution are, however, applicable to certain Let Pr be a string of symbols that of the other cases. represents a possible proof of theorem Th. In both in version and optimization Let M be a program that examines Th and problems, the problem itself is represented I Pr. If Pr is a legal proof of Th, then its as a string of symbols. For inversion output is "Yes", otherwise it is 11No". problem, MCx> = c , this will cons ist of The problem of finding a proof becomes the program M followed by the string, c. that of finding a string s such that For optimization problems M : max, I MCTh,sl = Yes • in time1', our problem is represented by the There are very many other problems that program M followed by the number, 1'. can be expressed as machine inversion For inversion problems, the solution, problems. x , will always be a string of the required I form. 50 I For an optimizat"ion problem, a solution fheorem II: If a correct solution to I wi 11 be a program that looks at the program the problem is assigned a probability PJ by M, and time"!'-, and results of previous the distribution, and it takes time tj to trials and from these creates as output, the generate and test that solution, then the next trial input to M. This program is algorithm described will take a total search always re presentable as a string, just as is time of less than 2 tj /p; to find that I the solution to an inversion problem. solution. Before telling how to solve these two Theorem III: If all of the information broad categories of problems, I want to we have to solve the problem,is in the introduce a simple theorem in probabiity. probability distribution, and the only I At a certain gambling house there is a information we have about the time needed to set of po ssible bets available - all with generate and test a candidate is by the same big prize. The it-" possible bet experiment, then this search method is has probability P; of wirrning and it costs within a factor of 2 of the fastest way to d;. dollars to make the i bet, All solve the problem. I probabilities are independent and one can�t make any particular bet more than once. The In 1973 L. Levin [5,131 used an pi need not be normalized. algorithm much like this one to solve the If all the d, are one dollar, the best same kinds of problems, but he did not bet is clearly the one of maximun Pi • If postulate that all of the information needed I one doesn't win on that bet, try the one of to solve the problems was in the equivalent next largest Pi , etc. This strategy gives of the probability distribution. Lacking the least number of expected bets before this strong postulate, his conclusion was winning. weaker - i. e. that the method was within a I If the d, are not all the same, the constant factor of optimum. Though it was best bet is that for which p,ldi is clear that this factor could often be very maximum, This gives the greatest win large, he conjectured that under certain probability per dollar. circumstances it would be small. Theorem Theorem I: If one continues to select III gives one condition under which the 2. I subsequent bets on the basis of maximum factor is In artificial intelligence research, pildi , the expected total money spent before winning will be minimal, problem solving techniques optimum within a In another context, if the cost of each factor of 2 are normally regarded as much I more than adequate, so a superficial reading bet is not dollars, but time, t, , then the of Theorem III might regard it as a claim to betting criterion p�/ti gives least have solved most of the problems of A. I.! expected time to win. This vmuld be an inaccurate interpretation. In order to use this theorem to solve a Theorem III postulates that we put all I problem we would like to have the functional of the needed problem solving information, form of our conditional probability (both general heuristic information as well distribution suitably tailored to that as pr oblem specific information> in the problem. If it were an inversion problem, probability distribution. To do this, we defined by M and c, we would like a function I usually use the problem solving techniques with M and c as inputs, that gave us, as that are used in other kinds of problem output a sequence of candidate strings in solving systems and translate them into order of decreasing Pilt� • Here p� is the modifications of the probability probability that the candidate wi 11 solve distribution. The process is analogous to the problem, and ti is the time it takes to I the design of an expert system by generate and test that candidate. If we had translating the knowledge of a human expert such a function, and it contained all of the into a set of rules. While I have developed information we had about solving the some standard techniques for doing this the problem, an optimum search for the solution translation process is not always a simple I would simply in volve testing the candidates routine. Often it gives the benefit of in the order given by this distribution. viewing a problem from a different Unfortunately we rarely have such a perspective, yielding new, better distribution available - but we can obtain understanding of it. Usually it is po ss ible something like it from which we can get good I to si mplify and improve problem solving solutions. techniques a great deal by adding One such form has input M and c as probabilistic information. before, but as output it has a sequence of I string, probability pairs and information. This is dona by find ing we want to put this information into the regularities in it and expressing these probability distribution. This done by a regularities as short codes. process of "compression". r�e start out with �hy do we want to compress the the probability distribution& It has been information in our probability distribution? I obtained by finding one or more short codes First: By compressing it, we find for a body of data, 0 • Let the single general laws in the data. These laws length equivalent of all of these codes be automatically interpolate, extrapolate and L0 • Suppose the description length of our smooth the data. • I new problem, solution pair, PS , is L,s Second• By expressing a lot of "Compression·" consists of finding a code tor information as a much smaller collection of the compound object, operator. A critical question in the forgoing Eventually the distribution will acquire discussion is whether we can represent all I enough information to be able to do useful can always be dona, but will give some compression in the available time without I fairly general examples of how to do it. human guidance. The first example will be part of the �hat are the principal advantages of problem of learning algebraic notation from expressing all of our knowledge in examples. The examples are of the form I probabalistic form? 35, 41, + I 76 Firstr �e have a near optimum method 8, 9, X I 72 of solving problems in that form. -8, I, + r -7 etc. Second& It is usually not difficult to fha examples all use +, -, x, and + only. put our information into that form and when The problem is for the machine to induce the I we do so, we often find that we can improve relationship of the string to the right of problem solving-methods considerably. the colon, to the rest of the expression. 52 I fa do this it has a vocabulary ot 7 I symbols: R, , R,_ and R3 , represent the first 3 symbols of its in put. Add, Sub, Mul, Div, represent internal I oper�tors that can operate on the contents at R, , R,. , and R� it they are numbers. The system tries to find a short sequence of these 7 symbols that represents a program expressing the symbol to the right I of the colon in terms of the other symbols. + 35 , 41, : 76 can be written as 35, 41, + :R, , R,._ , Add. If all symbols Because the outputs of M, and M1 go have equal pr obability to start, the back to "planner" we have a recursive system subsequence I that can generate a search tree of infinite R, , R,_ , Add has probability depth. 1/7 1/7 x 1/7 = 1/343 • If we assume 16 x However, the longer solution trials bit precision, each integer has probability have much 1 ess probability, and so, when we I 2-'6 "' 1/65536 - So 1/343 is a great use the optimum search algorithm, we will improvement over the original data. tend to search the high probability trials can code the right halves of the ae first Cunless they take too long to original expressions as generate and test>. R, , Rl. , Add R, I , R,. , Mul The third example is an anlysis of the R ' R,. Add I ' conce pt "Analogy" from the viewpoint of If there are many examples like these, Algorithmic probability. it will be noted that the probability that The best known A.I. work in this area the symbol in the first position is R, , is th at of T. Evans - a program to solve I is close to one. Similarly, the probability geometric analogy problems such as those that the second symbol is R1 is close to used in intelligence tests 1141. We will one. This gives a probability of close to use an example of this sort. I /7 for R , , R .... , Add • Given (alO. 0 , (blv � Ccl -L T �e can increase the probability of our ce>o6 . I 6? code further by noting that in the Is d or e more likely to be a member of expressions like the s e t a, b, c ? 35, 41' + R R Add ' : I ' 2.. ' �e will devise short descriptions for the last symbol is closely correlated with the sets a, b, c, d and a, b, c, e, and I the third symbol -so that knowing the third show th�t e, b, c, d has a much shorter symbol, we can assign very high probability description. to the final symbols that actually occur. The set a, b, c, d can be described have spoken of assigning high I and low by a single operator followed by a string of probabilities to various symbols. How does I operands a this relate to length of codes? If a symbol op,= [print the operand, then invert has probability p , we can devise a binary operand and print it to the rightJ. code for that symbol description of _L , description of 6 J. we can have an equivalent code length of To describe a, b, c, e we will also p • exactly - log need the operator Op,_ a I Op:o.= [print the operand, then print 1 t The second example is a common kind of again to the rightJ.

planning heuristic. �hen a problem is A short description of a, b, c, e is received by the system, it goes to the thenr I "planner" mod ule. The module examines the <2> Op, [description of D , problem and on the basis of this examination description of \1 , description of .J.. J, assigns 4 probabilities to it, P, , P,.. , P3 , Op:>.. (description of 6 l. and P+ • It is clear that (I l is a much shorter P, is the probability that the quickest description than <2>. We can make this I solution will be obtained by first breaking analysis quan titative if we actually write the problem into subproblems C"Divide and out many descriptions of these two sets in Conquer">, that must all be solved. Module some machine code. M, breaks up the problem and sends the If a, b, c, d is found to have I individual subproblems back to "planner". descriptions of lengths 100, 103, 103, and P� is the probability that the quickest 105, then its total probability will be -roO solution is obtainable by transforming the 2-IO·�X(I+2-� +2-.i +2-s-) "' 1 .28135 X 2 problem into several alternative equivalent If a , b, c, e has description lengths problems. Module M� dGes these 105, 107, 108, it will have a total I transformations and sends the resultant probabi 11 tv of -s" 100 problems to "planner". 2-10" X [ 2 +2-7 +2-9 ] = 2� X .0429688 P� is the probability of solution by The ratio of these probabilities is method M� • M� could, tor example, be a 29. 8, so if these code lengths were correct, 'jl I routine to solve algebraic equations. the symbols o would have a probability P+ is the probability of solution by 29.8 6 o times as great as of· being a member method M�. M+ could, for example, be a • of the set a, b, c routine to perform symbolic integration. The concept of analogy is pervasive in The operation of the System in many forms in science and mathematics. I assigning probabilities to various possible Mathematical analogies between mechanical solution trials, looks much like the and electrical systems make it possible to probability as�ignment process in a predict accurately the behavior of one by I stochastic production grammar. 53 anlyzinq the behavior of the other. In all ot these systems, the pair of things that I IV. PRESENT STATE oF DEVELOPMENT OF analogous are obtainable by a common are THE SYSTEM operator such as Op, operating on different operands. In all such cases, the kind of How far have we gone toward realizing anlysis that was used in our example can be this system as a computer program? I directly applied. Very little has been programmed. The only program specifically written for this The fourth example is a discussion of system is one that compresse� a text

clustering as an inductive technique. consisting of a long sequence of symbols. Su ppose we have a large number of It first assigns probabilities to the I symbols. Then new symbols are defined that objects, and the 1 t"- object is characterized by k discrete or continuous represent short sequences appearing in the parameters, < ai, , a1,_ , • • • atk. l • In text £8, pp 232-2401, and they are also clustering theory we ask, "is there some assigned probabilities. The principles of I natural way to break this set of objects this program are very useful, since in most bodies of data the main modes of into a bunch of uclusters" I probabilistic information about the structure corresponding to the.examples. distribution of points this can be used to The probabilistic form tor this problem . define good metrics to describe the simplifies the solution considerably, so deviations of points from their respective that probabilities for each possible centers. structu re can be obtained with very little I calculation. The system is able to learn The fifth example is a description of even if there are no negative examples - frame theory as a variety of clustering. which is well beyond the capabilities of Minsky·'s introduction to frames r 15l Winston's program. That probabilities are I treats them as a method of describing obtai ned rather than "best gue sses" is an complex objects and storing them in memory. important improvement. This makes it poss­ An example of a frame is a children's ible to use the results to obtain optimum party. A children's party has many statisticaL decisions. "Best guesses" parameters that describe it: Who is without probabilities are of only marginal I giving the party? What time· of day will it value in statistical decision theory. be given? Is it a birthday party? (i.e. There are a few areas in which we Must we bring presents?l Will it be in a haven't yet found very good ways to express large room? What will we eat? What will we information through probability I do? etc, etc. If we know nothing about it distributions. Finding techniques to exand other than the fact that it's a party tor the expressive power of these distributions children, each of the parameters will have remains a direction of continued resea rch. a."def�ult value" - this standard set of However, the most important present default parameters defines the ·"standard task is to write programs demonstrating the I children's partyJI. This standard can be problem solving capabilities of the system regarded as a Jlcenter point� of a cluster in the ma ny areas where representation of space. knowledge in the form of probability As we learn more about the party, we distributions is well understood. find out the true values of many of the I parameters that had been given default assignments. This moves us away from the center of our cluster, with more complex descriptions of I the parameters of the party. Certain of the parameters of the party can, in turn be described as frames, having sets of default values which may or may not change as we I gain more information. I

54 I I References: (IJ Chaitin, G.J., "Randomness and Mathematical Proof", Scientific American, Vol 232, No. 5, Pp. 47- 52; May 1975. I --, "Information-Theoretic Limitations of Formal Systems", Journal of ACM, Vol. 21, Pp. 403-4 24; July 1974.

<2 > - - , "A Theory of Program Size I Formally Identical to Information Theory••, Journal of ACM, Vol. 22, No. 3, Pp. 329-340, July 1975. (3) Cover, T.M., "Universal Gambling Schemes and the Complexitv Measures of I Kolmogorov and Chaitin", R ep. 12, Statistics Dept., Stanford Univ., Stanford, Cal., 1974. (4} Kolmogorov, A.N., 11Three Approaches to the Quantitative Definition of I Information", Information Transm.ission, Vol. I, Pp • 3- 11, I 965. --, "Logical Basis for Information Theory and Probability Theory", IEEE Transactions on Information Theory, IT-14, I Pp. 662-664, Sept. 1968. (5) Levin, L.A., "Universal Search Problems", Problemy E'eredaci Intormac11 9 (1973>, Pp. 115-116. Translated in Problems of Information Transmission 9, 265-266. I (6) Martin Lot, P., •The Definition of Random Sequences", Information and Control, Vol. 9, No. 6, Dec. 1966, Pp. 602-619. <7J Solomonoff, R.J., "A Preliminary I Report on a General Theory of Inductive Inference", ZTB-138, Zator co., Cambridge, Mass., Nov. 1960; also AFOSR TN-50 -1459. (8) -- , "A Formal Theory ot Inductive I Inference", Information and Control, Part I, Vol 7, No. I, Pp. 1-22; March 1964, and Part II, Vol. 7, No. 2, Pp. 224-254; June 1964. (9J -- , "Training Sequences for Mechanized Induction", in �self Organizing I Systems", M. Yovits, ed., 1962. ( 1Ol -- , "Inductive Inference Theory - A Unified App roach to Problems in Pattern Recognition and Artificial Intelligence", I 4th Int. Joint Cont. on A.I., Tbilisi, Georgia, USSR, Pp. 274-280; Sept. 1975. (11> ---, "Complexity-Based Induction Systems: Comp arisons and Convergence Theo­ rems", IEEE Trans. on Information Theory, I Vol IT-24, No. 4, Pp. 422 -432; July 1978. < 12) --, "Perfect Training Sequences and the Costs of Corruption - A Progress Report on Inductive Inference Research", I Oxbridge Rese arch, Box 559, Cambridge, Mass. 02238, Aug. 1982. (13J -, 110ptimum Sequential Search", Oxbridge Research, Box 559, Cambridge, Mass. 02238, June 1984. I (14J Ev ans, T., "A Heuristic Program for Solving Geometric Analogy Problems", Ph.d. Dissertation, M.I.T., Cambridge, Mass, 1963. (15> Minsky, M., "A Framework for I Representing Knowledge", in P. rlinston