Algorithmic Probability, Heuristic Programming and AGI

Algorithmic Probability, Heuristic Programming and AGI Ray J. Solomono® Visiting Professor, Computer Learning Research Center Royal Holloway, University of London IDSIA, Galleria 2, CH{6928 Manno{Lugano, Switzerland [email protected] http://world.std.com/~rjs/pubs.html Introduction Machine with three tapes: a unidirectional input tape, This paper is about Algorithmic Probability (ALP) and a unidirectional output tape, and a bidirectional work Heuristic Programming and how they can be combined tape. If we feed it an input tape with 0's and 1's on it, to achieve AGI. It is an update of a 2003 report de- the machine may print some 0's and 1's on the output scribing a system of this kind (Sol03). We ¯rst describe | It could print nothing at all or print a ¯nite string ALP, giving the most common implementation of it, and stop or it could print an in¯nite output string, or it then the features of ALP relevant to its application to could go into an in¯nite computing loop with no print- AGI. ing at all. They are: Completeness, Incomputability, Subjectiv- Suppose we want to ¯nd the ALP of ¯nite string x. ity and Diversity. We then show how these features en- We feed random bits into the machine. There is a cer- able us to create a very general, very intelligent prob- tain probability that the output will be a string that lem solving machine. For this we will devise \Training starts out with the string x. That is the ALP of string Sequences"| sequences of problems designed to put x. problem-solving information into the machine. We de- To compute the ALP of string x: scribe a few kinds of training sequences. X1 ¡jSi(x)j The problems are solved by a \generate and test" al- PM (x) = 2 gorithm, in which the candidate solutions are selected i=0 through a \Guiding Probability Distribution". The use Here PM (x) is the ALP (also called Universal Prob- of Levin's search procedure enables us to e±ciently con- ability) of string x with respect to machine, M. sider the full class of partial recursive functions as possi- There are many ¯nite string inputs to M that will ble solutions to our problems. The guiding probability give an output that begins with x. We call such strings distribution is updated after each problem is solved, so \codes for x". Most of these codes are redundant in that the next problem can pro¯t from things learned in the sense that if one removes its most recent bit the the previously solved problems. resultant string will still be a \code for x". A \minimal We describe a few updating techniques. Improve- code for x" is one that is not redundant. If one removes ments in updating based on heuristic programming is its last bit, the result will no longer be a\code for x". th one of the principal directions of current research. De- Say jSi(x)j is the length in bits of the i \Minimal code signing training sequences is another important direc- for x". tion. 2¡jSi(x)j is the probability that the random input will For some of the simpler updating algorithms, it is begin with the\ ith minimal code for x". easy to \merge" the guiding probabilities of machines PM (x) is then the sum of the probabilities of all the that have been educated using di®erent training se- ways that a string beginning with x, could be generated. quences | resulting in a machine that is more intel- This de¯nition has some interesting properties: ligent than any of the component machines . First, it assigns high probabilities to strings with short descriptions | This is in the spirit of Ockham's What is Algorithmic Probability? razor. It is the converse of Hu®man coding that assigns ALP is a technique for the extrapolation of a sequence short codes to high probability symbols. of binary symbols | all induction problems can be put Second, its value is somewhat independent of what into this form. We ¯rst assign a probability to any ¯- universal machine is used, because codes for one uni- nite binary sequence. We can then use Bayes' theorem versal machine can always be obtained from another to compute the probability of any particular continua- universal machine by the addition of a ¯nite sequence tion sequence. The big problem is: how do we assign of translation instructions. these probabilities to strings? In one of the commonest A less apparent but clearly desirable property | implementations of ALP, we have a Universal Turing PM (x) is complete. This means that if there is any describable regularity in a batch of data, PM will ¯nd The usual question is | \What good is it if you can't it, using a relatively small amount of the data. At this compute it?" The answer is that for practical predic- time, it is the only induction method known to be com- tion we don't have to compute ALP exactly. Approxi- plete (Sol78). mations to it are quite usable and the closer an approx- More exactly: Suppose ¹(x) is a probability distribu- imation is to ALP, the more likely it is to share ALP's tion on ¯nite binary strings. For each x = x1; x2 ¢ ¢ ¢ xi, desirable qualities. ¹ gives a probability that the next bit, xi+1 will be 1: Perhaps the simplest kind of approximation to an in- ¹(xi+1 = 1jx1; x2 ¢ ¢ ¢ xi) computablep number involves making rational approxi- From PM we can obtain a similar function P (xi+1 = mations to 2. We know that there is no rational num- 1jx1; x2 ¢ ¢ ¢ xi). ber whose square is 2, but we can get arbitrarily close Suppose we use ¹ to generate a sequence, x, Monte approximations. We can also compute an upper bound Carlo-wise. ¹ will assign a probability to the i + 1th on the error of our approximation and for most meth- bit based on all previous bits. Similarly, P will assign ods of successive approximation we are assured that the th a probability to the i + 1 bit of x. If PM is a very errors approach zero. In the case of ALP, though we good predictor, the probability values obtained from ¹ are assured that the approximations will approach ALP and from PM will be very close, on the average, for long arbitrarily closely, the incomputability implies that we sequences. What I proved was: cannot ever compute useful upper bounds on approximation error | but for few if any practical applications do we need this information. Xn The approximation problem for the universal distri- E¹ (¹(xi+1 = 1jx1; x2 ¢ ¢ ¢ xi) bution is very similar to that of approximating a so- i=1 lution to the Traveling Salesman Problem, when the 1 ¡P (x = 1jx ; x ¢ ¢ ¢ x ))2 · k ln 2 number of cities is too large to enable an exact solu- i+1 1 2 i 2 tion. When we make trial paths, we always know the total length of each path | so we know whether one The expected value of the sum of the squares of trial is better than another. In approximations for the the di®erences between the probabilities is bounded by universal distribution, we also always know when one about :35k. k is the minimum number of bits that M, approximation is better than another | and we know the reference machine, needs to describe ¹. If the func- how much better. In some cases, we can combine trials tion ¹ is describable by functions that are close to M's to obtain a trial that is better than either of the compo- primitive instruction set, then k will be small and the nent trials. In both TSP and ALP approximation, we error will be small. | But whether large or small, the never know how far we are from the theoretically best, squared error in probability must converge faster than 1 P 1 yet in both cases we do not hesitate to use approximate n (because n diverges). solutions to our problems. Later research has shown this result to be very robust The incomputability of ALP is closely associated with | we can use a large, (non-binary) alphabet and/or its completeness. Any complete induction system can- use error functions that are di®erent from total square not be computable. Conversely, any computable induc- di®erence (Hut02). The probability obtained can be tion system cannot be complete. For any computable normalized or unnormalized (semi-measure)(G¶ac97). induction system, it is possible to construct a space of The function ¹ to be \discovered" can be any describ- data sequences for which that system gives extremely able function | primitive recursive, total recursive, or poor probability values. The sum of the squared errors partial recursive. When ALP uses an unnormalized diverges linearly in the sequence length. semi-measure, it can discover incomputable functions Appendix B gives a simple construction of this kind. as well. We note that the incomputability of ALP makes such The desirable aspects of ALP are quite clear. We a construction impossible and its probability error al- know of no other model of induction that is nearly as ways converges to zero for any ¯nitely describable se- good quence. An apparent di±culty | PM (x) is incomputable: To explain our earlier remark on incomputability as The equation de¯ning PM (x) tells us to ¯nd all strings a very desirable feature: Incomputability is the only that are \minimal codes for x." Because of the Halting way we can achieve completeness. In ALP this incom- Problem, it is impossible to tell whether certain strings putability imposes no penalty on its practical applica- are codes for x or not.

Load more