Philosophical Applications of

Computational Learning Theory

Chomskyan Innateness and Occams Razor

Afstudeerscriptie Filosoe van de Informatica sp

R M de Wolf

Erasmus Universiteit Rotterdam

Vakgro ep Theoretische Filosoe

Begeleider Dr G J C Lokhorst

Juli

Contents

Preface v

Intro duction to Computational Learning Theory

Intro duction

Algorithms that Learn Concepts from Examples

The PAC Mo del and Eciency

PAC Algorithms

Sample Complexity

Time Complexity

Representations

PolynomialTime PAC Learnability

Some Related Settings

PolynomialTime PAC Predictability

Memb ership Queries

Identication from Equivalence Queries

Learning with Noise

Summary

Application to Language Learning

Intro duction

A Brief History of the Chomskyan Revolutions

Against Behaviorism

TransformationalGenerative Grammar

PrinciplesandParameters

The Innateness of Universal Grammar

Formalizing Languages and Grammars

Formal Languages

Formal Grammars

The Chomsky Hierarchy

Simplifying Assumptions for the Formal Analysis

Language Learning Is Algorithmic Grammar Learning

All Children Have the Same Learning Algorithm

No Noise in the Input

PAC Learnability Is the Right Analysis

Formal Analysis of Language Learnability

The PAC Setting for Language Learning i

ii CONTENTS

A Conjecture

The Learnability of Typ e Languages

The Learnability of Typ e Languages

The Learnability of Typ e Languages

Of What Typ e Are Natural Languages

A Negative Result

k Bounded ContextFree Languages Are Learnable

Simple Deterministic Languages are Learnable

The Learnability of Typ e Languages

Finite Classes Are Learnable

What Do es All This Mean

Wider Learning Issues

Summary

and Simplicity

Intro duction

Denition and Prop erties

Turing Machines and Computability

Denition

Ob jectivity up to a Constant

NonComputability

Relation to Shannons Information Theory

Simplicity

Randomness

Finite Random Strings

Innite Random Strings

An Interesting Random String

Godels Theorem Without SelfReference

The Standard Pro of

A Pro of Without SelfReference

Randomness in Mathematics

Summary

Occams Razor

Intro duction

Occams Razor

History

Applications in Science

Applications in Philosophy

Occam and PAC Learning

Occam and Minimum Description Length

Occam and Universal Prediction

On the Very Possibility of Science

Summary

A Pro of of Occams Razor PAC Version

CONTENTS iii

Summary and Conclusion

Computational Learning Theory

Language Learning

Kolmogorov Complexity

Occams Razor

Conclusion

List of Symb ols

Bibliography

Index of Names

Index of Sub jects

iv CONTENTS

Preface

What is learning Learning is what makes us adapt to changes and threats and

what allows us to cop e with a world in ux In short learning is what keeps

us alive Learning has strong links to almost any other topic in philosophy

scientic inference knowledge truth reasoning logic language anthrop ology

b ehaviour ethics go o d taste aesthetics and so on Accordingly it can b e

seen as one of the quintessential philosophical topicsan appropriate topic for

a graduation thesis Much can b e said ab out learning to o much to t in a

single thesis Therefore this thesis is restricted in scop e dealing only with

computational learning theory often abbreviated to COLT

Learning seems so simple we do it every day often without noticing it

Nevertheless it is obvious that some fairly complex mechanisms must b e at work

when we learn COLT is the branch of Articial Intelligence that deals with the

computational prop erties and limitations of such mechanisms The eld can b e

seen as the intersection of Machine Learning and complexity theory COLT is a

very young eldthe publication of Valiants seminal pap er in may b e seen

as its birthand it app ears that virtually none of its many interesting results

are known to philosophers Some philosophical work has referred to complexity

theory for instance Che and some has referred to Machine Learning for

instance Tha but as far as I know thus far no philosophical use has b een

made of the results that have sprung from COLT For instance at the time of

writing of this thesis the Philosophers Index a database containing most ma jor

publications in philosophical journals or b o oks contains no entries whatso ever

that refer to COLT or to its main mo del the mo del of PAC learning there is

hardly any reference to Kolmogorov complexity either Stuart Russell devotes

two pages to PAC learning Rus pp and James McAllister devotes one

page to Kolmogorov complexity as a quantitative measure of simplicity McA

pp but b oth do not provide more than a sketchy and sup ercial

explanation

As its title already indicates the present thesis tries to make contact b e

tween COLT and philosophy The aim of the thesis is threefold The rst and

most shallow goal is to obtain a degree in philosophy for its author The second

goal is to take a numb er of recent results from computational learning theory

insert them in their appropriate philosophical context and see how they b ear on

various ongoing philosophical discussions The third and most ambitious goal

is to draw the attention of philosophers to computational learning theory in

general Unfortunately the traditional reluctance of philosophers in particular

those of a nonanalytical strand to use formal metho ds as well as their inapt v

vi CONTENTS

ness with formal metho ds once they have outgrown that reluctance makes this

goal rather hard to attain Nevertheless it is my opinion that computational

learning theoryor for that matter complexity theory as a wholehas much

to oer to philosophy Accordingly this thesis may b e seen as a plea for the

philosophical relevance of computational learning theory as a whole

The thesis is organized as follows In the rst chapter we will give an

overview of the main learning setting used in COLT We will stick to the rig

orous formal denitions as used in COLT but supplement them with a lot of

informal and intuitive comment in order to make them accessible and readable

for philosophers After that intro ductory chapter the second chapter applies

results from COLT to Noam Chomskys ideas ab out language learning and the

innateness of linguistic biases The third chapter gives an intro duction to the

theory of Kolmogorov complexity which provides us with a fundamental mea

sure of simplicity Kolmogorov complexity do es not b elong to computational

learning theory prop er its invention in fact predates COLT but its main

application for us lies in Occams Razor This fundamental maxim tells us to

go for the simplest theories consistent with the data and is highly relevant in

the context of learning in general and scientic theory construction in partic

ular The fourth chapter provides dierent formal settings in which some form

of the razor can b e mathematically justied using Kolmogorov complexity to

quantitatively measure simplicity Finally we end with a brief chapter that

summarizes the thesis in nontechnical terms

Let me end this preface by expressing my gratitude to a numb er of p eo

ple who contributed considerably to the contents of this thesis First of all

I would of course like to thank my thesis advisor GertJan Lokhorstone of

those philosophers who like myself do not hesitate to invoke formal deni

tions and results whenever these might b e usefulfor his many comments and

suggestions Secondly many thanks should go to ShanHwei NienhuysCheng

who put me on the track of inductive learning in the rst place by inviting me

to join her in writing a b o ok on inductive logic programming NW a b o ok

which eventually to ok us more than two and a half years to nish Chapter

of that b o ok actually provided the basis for a large part of the rst chapter

of the present thesis Finally I would like to thank Jero en van Rijen for his

many helpful comments Peter Sas and Peter Grun wald for some references

on linguistics and Paul Vitanyi for his course on learning and Kolmogorov

complexity

Chapter

Intro duction to

Computational Learning

Theory

Intro duction

This thesis is ab out philosophical applications of computational learning the

ory and the present chapter provides an intro duction to this eld

Why should we as philosophers b e interested in something like a theory

of learning The imp ortance of learning can b e illustrated on the basis of the

following quotation from Homers Iliad

Him she found sweating with toil as he moved to and fro ab out his

b ellows in eager haste for he was fashioning trip o ds twenty in all

to stand around the wall of his wellbuilded hall and golden wheels

had he set b eneath the base of each that of themselves they might

enter the gathering of the go ds at his wish and again return to his

house a wonder to b ehold

Iliad XVI I I pp of Hom second volume

This quotation might well b e the rst ever reference to something like Articial

Intel ligence manmade or in this case go dmade artifacts displaying intelli

gent b ehaviour As Thetis Achilles mother enters Hephaestus house in order

to fetch her son a new armour she nds Hephaestus constructing something

we to day would call robots His twenty trip o ds are of themselves to serve the

gathering of the go ds bring them fo o d etc whenever Hephaestus so desires

Let us consider for a moment the kind of b ehaviour such a trip o d should

display Obviously it should b e able to recognise Hephaestus voice and to

extract his wishes from his words But furthermore when serving the go ds the

trip o d should know and act up on many requirements such as the following

If there is roasted owl for dinner dont give any to Pallas Athena

Dont come to o close to Hera if Zeus has committed adultery again

Stop fetching wine for Dionysus when he is to o drunk

CHAPTER COMPUTATIONAL LEARNING THEORY

It is clear that this list can b e continued without end Again and again one

can think of new situations that the trip o d should b e able to adapt to prop erly

It seems imp ossible to take all these requirements into account explicitly in the

construction of the intelligent trip o d The task of co ding each of the innite

numb er of requirements into the trip o ds may b e considered to o much even for

Hephaestus certainly one of the most industrious among the Greek go ds

One solution to this problem would b e to initially endow the trip o d with a

mo dest amount of general knowledge ab out what it should do and to give it the

ability to learn from the way the environment reacts to its b ehaviour That is if

the trip o d do es something wrong it can adjust its knowledge and its b ehaviour

accordingly thus avoiding to make the same mistake in the future In that

way the trip o d need not know everything b eforehand Instead it can build

up most of the required knowledge along the way Thus the trip o ds ability to

learn would save Hephaestus a lot of trouble

The imp ortance of learning is not restricted to artifacts built to serve divine

wishes Human b eings from birth till death engage in an ongoing learning

pro cess After all children are not b orn with their native language p olite

manners the abilities to read and write earn their living make jokes or to

do philosophy or science or even philosophy of science These things have

to b e learned In fact if we take learning in a suciently broad sense any

kind of adaptive b ehaviour will fall under the term Since learning is of crucial

imp ortance to us a theory of learning is of crucial imp ortance to philosophy

Algorithms that Learn Concepts from Exam

ples

After the previous section the imp ortance of a theory of learning should b e

clear But why computational learning theory Well a theory of learning

should b e ab out the way we human b eings learn or more generally ab out the

way any kind of learning can b e achieved A description of a way of learning

will usually b oil down to something like a recip e learning amounts to doing

suchandsuch things taking suchandsuch steps Now it seems that the only

precise way we have to sp ecify this such and such stepsand of course we

should aim for precisionis by means of an algorithm In general an algorithm

or a mechanical pro cedure precisely prescrib es the sequence of steps that have

to b e taken in order to solve some problem Hence it is obvious that learning

theory can or p erhaps even should involve the study of algorithms that learn

Since algorithms p erform computation algorithmic theory of learning is usually

called computational learning theory

In this thesis we will mainly b e concerned with learning a concept from

Of course for this scheme to work we have to assume that the trip o d survives its initial

failures If Zeus immediately smashes the trip o d into pieces for bringing him white instead of

red wine the trip o d wont b e able to learn from its exp erience

The notion of an algorithm includes neural networks at least those that can b e simulated

on an ordinary computer In fact the learnability of neural networks has b een one of the most

prominent research areas in computational learning theory

ALGORITHMS THAT LEARN CONCEPTS FROM EXAMPLES

examples If we take the term example in a suciently broad sense almost

any kind of learning will b e based on examples so restricting attention to

learning from examples is not really a restriction On the other hand restricting

attention to learning a concept app ears to b e quite restrictive since it excludes

learning knowhow knowledge for instance learning how to run a marathon

However many cases of knowhow learning can actually quite well b e mo deled

or redescrib ed as cases of concept learning For instance learning a language

may at rst sight app ear to b e a case of learning knowhow ie knowing how

to use words but it can also b e mo deled as the learning of a grammar for

a language Accordingly concept learning can b e used to mo del any kind

of learning in which the learned thing can feasibly b e represented by some

mathematical constructa grammar a logical formula a neural network and

what not This includes a very wide range of topics from language learning to

large parts of empirical science

Induction which is the usual name for learning from examples has b een a

topic of inquiry for centuries The study of induction can b e approached from

many angles Like most other scientic disciplines it started out as a part of

philosophy Philosophers particularly fo cused on the role induction plays in

the empirical sciences For instance Aristotle characterized science roughly as

deduction from rst principles which were to b e obtained by means of induction

from exp erience Ari Though it should b e noted that Aristotles notion of

induction was rather dierent from the mo dern one involving the seeing of

the essential forms of examples

After the Middle Ages Francis Bacon Bac revived the imp ortance of

induction from exp erience in the mo dern sense as the main scientic activity

In later centuries induction was taken up by many philosophers David Hume

Hum Hum formulated what is nowadays called the problem of induction

or Humes problem how can induction from a nite numb er of cases result in

knowledge ab out the innity of cases to which an induced general rule applies

What justies inferring a general rule or law of nature from a nite numb er

of cases Surprisingly Humes answer was that there is no such justication

In his view it is simply a psychological fact ab out humans b eings that when

we observe some particular pattern recur in dierent cases without observing

counterexamples to the pattern we tend to exp ect this pattern to app ear

in all similar cases In Humes view this inductive exp ectation is a habit

analogous to the habit of a dog who runs to the do or after hearing his master

call exp ecting to b e let out Later philosphers such as John Stuart Mill Mil

tried to answer Humes problem by stating conditions under which an inductive

inference is justied Other philosophers who made imp ortant comments on

induction were Stanley Jevons Jev and Charles Sanders Peirce Pei

In our century induction was mainly discussed by philosophers and mathe

maticians who were also involved in the development and application of formal

logic Their treatment of induction was often in terms of the probability or

the degree of conrmation that a particular theory or hyp othesis receives

from available empirical data Some of the main contributors are Bertrand

This is what the Chomskyan revolution in linguistics is all ab out see the next chapter

CHAPTER COMPUTATIONAL LEARNING THEORY

Russell Rus Rus Rudolf Carnap Car Car Carl Hemp el Hema

Hemb Hem Hans Reichenbach Rei and Nelson Go o dman Go o

Particularly in Go o dmans work an increasing numb er of unexp ected concep

tual problems app eared for induction In the s and s induction was

sworn o by philosophers of science such as Karl Popp er Pop

However in roughly those same years it was recognised in the rapidly ex

panding eld of Articial Intelligence that the knowledge an AI system needs

to p erform its tasks should not all b e handco ded into the system b eforehand

Instead it is much more ecient to provide the system with a relatively small

amount of knowledge and with the ability to adapt itself to the situations it

encountersto learn from its exp erience Thus the study of induction switched

from philosophy to Articial Intelligence The branch of AI that studies learn

ing is called Machine Learning As Marvin Minsky one of the founders of AI

wrote Articial Intelligence is the science of making machines do things that

would require intelligence if done by man Min p v Given this view the

study of induction is indeed part of AI since learning from examples certainly

requires intelligence if done by man

The PAC Mo del and Eciency

Since Machine Learning is concerned with formal learning algorithms it needs

formal mo dels of what it means to learn something what kinds of examples

and other resources do es a learning algorithm have at its disp osal and what are

its goals In general a learning algorithm reads a numb er of examples for some

unknown target concept and has to induce or learn some concept on the basis of

these examples Initial analysis of learnability in Machine Learning was mainly

done in terms of Golds paradigm of identication in the limit Gol but

nowadays Valiants paradigm of PAC learnability Val is usually considered

to provide a b etter mo del of learnability A PAC algorithm is an algorithm

that reads examples concerning some target concept which is taken from some

class F of concepts The algorithm knows from which class the target concept

is chosen but it do es not know which particular concept is the target and its

only access to the target is through the examples it reads These examples will

usually not provide complete knowledge of the target concept so we cannot

exp ect our algorithm to learn the target exactly Instead we can only hop e to

learn an approximately correct concept a concept which diverges only slightly

from the target Moreover since the given set of examples may b e biased and

need not always b e a go o d representative of the target concept as a whole we

cannot even exp ect to learn approximately correctly every time Accordingly

the b est our algorithm can do is learn a concept which is probably approximately

correct PAC with resp ect to the target concept whenever the target is drawn

from F That is a PAC algorithm should with high probability learn a concept

which diverges only slightly from the target concept

Interestingly enough Thomas Kuhn Popp ers antip o de in the philosophy of science

later b ecame involved in computer mo dels of inductive concept learning from examples See

pp of Kuh

THE PAC MODEL AND EFFICIENCY

In the PAC mo del a class of concepts is only considered learnable if there is

an ecient PAC algorithm for that class Eciency concerns two ma jor com

plexity issues how many examples do we need to ensure that we will probably

nd an approximately correct concept sample complexity and how many steps

do we need to take to nd such a concept time complexity An algorithm is

ecient if b oth its sample and its time complexity can b e upp erb ounded by a

polynomial function in the inputs of the algorithm

Before embarking on formal exp ositions of PAC algorithms and their sample

and time complexity let us rst say something ab out why we call an algorithm

with p olynomiallyb ounded running time an ecient algorithm Supp ose we

want to solve some family of problems and we can measure the size of each

particular problem or instance in that family by some integer n For instance

we might want to construct an algorithm for solving the traveling salesman

problem TSP given a road map and a numb er of cities on the map nd

the shortest route that leads past each city on the map For simplicity let

us dene the size of a particular traveling salesman problem as the numb er n

of cities on the map Supp ose we have two algorithms for solving TSP each

takes a particular TSPinstance as input and nds the shortest route for us

in a nite numb er of steps Now supp ose the rst algorithm when given a

problem of size n as input gives the right answer after n steps while the

n

second needs steps Let us call the rst the polynomial algorithm and

the second the exponential algorithm Consider the numb er of steps needed by

these two algorithms for larger n

Numb er of cities

No of steps p olynomial

No of steps exp onential

As can b e seen from this table the time required by the exp onential algo

rithm really explo des for larger n while for the p olynomial algorithm it grows

much more mo derately Both algorithms solve the same problem correctly but

the p olynomial algorithm needs much less time for this than the exp onential

one

The relative eciency of the p olynomial algorithm can also b e seen in an

other way According to Mo ores well known law the sp eed of computers dou

bles every one and a half years Supp ose that in Januari we can solve

TSPs of length up to n in one hours time using the p olynomial algorithm

and TSPs of length up to n with the exp onential algorithm Now supp ose

computing p ower doubles and in July we have a computer which can make

Note carefully that learnability is a prop erty of classes of concepts rather than of individ

ual concepts A class consisting of a single concept f is always learnable b ecause a learning

algorithm can already know in advance that the target concept has to b e f in this case

A polynomial function with variables x x is a sum of terms of the form

n

e e

e

1 2

n

where c is an arbitrary real constant and the exp onents e e are non cx x x

n

n

negative real constants For instance x and x x x are p olynomials

In fact it is a big op en question whether TSP can b e solved by a p olynomialtime algo

rithm b ecause this problem in a slightly dierent form is known to b e NP complete See

note on p for more on NP completeness

CHAPTER COMPUTATIONAL LEARNING THEORY

twice as many steps in one hour as the computer we used in Januari Then

it is easy to show that with this faster computer we can solve TSPs of length

p

n n with the p olynomial algorithm an improvement of up to

ab out which is quite nice However with the exp onential algorithm we

can only solve problems of length up to n in one hour Thus an increase

in computing p ower makes a great dierence if we use the ecient p olynomial

algorithm but makes hardly any dierence using the exp onential algorithm

PAC Algorithms

Let us rst illustrate and motivate the denition of a PAC algorithm by means

of a metaphorical example Supp ose some biology student wants to learn from

examples to distinguish insects from other animals That is he wants to learn

the concept of an insect within the domain of all animals A teacher gives the

student examples a p ositive example is an insect a negative example is some

other animal The student has to develop his own concept of what an insect is

on the basis of these examples Now the student will b e said to have learned the

concept approximately correctly if when afterwards tested he classies only a

small p ercentage of given test animals incorrectly as insect or noninsect In

other words his own develop ed concept should not diverge to o far from the real

concept of an insect

In the interest of fairness we require that the animals given as examples

during the learning phase and the animals given afterwards as test are all

selected by the same teacher or at least by teachers with the same inclinations

For supp ose the student learns from a teacher with a particular interest in

Europ ean animals whose examples are mainly Europ ean animals Then it

would b e somewhat unfair if the animals that were given afterwards to test

the student were selected by a dierent teacher having a decisive interest in

the very dierent set of African insects In other words the student should b e

taught and tested by the same teacher

Let us now formalize this setting To my knowledge three dierent text

b o oks for computational learning theory exist to date resp ectively written

by Natara jan Nat Anthony and Biggs AB and Kearns and Vazirani

KV The formal denitions in this chapter will mostly follow Natara jan

Denition A domain X is a set of strings over some nite alphab et

n

The length of some x X is the string length of x X denotes the set of all

strings in X of length at most n

A concept f is a subset of X a concept class F is a set of concepts An

example for f is a pair x y where x X y is called the label of the example

y if x f and y otherwise If y then the example is positive if

y it is negative

If f and g are two concepts then f g denotes the symmetric dierence of

f and g f g f ng g nf 3

In our metaphor X would b e the set of descriptions of all animals the target

concept f X would b e the set of descriptions of all insects and the student

PAC ALGORITHMS

would develop his own concept g X on the basis of a numb er of p ositive

and negative examples ie insects and noninsects The symmetric dierence

f g would b e the set of all animals which the student classies incorrectly all

insects that he takes to b e noninsects and all noninsects he takes to b e insects

For technical reasons we restrict the examples to those of length at most

n n

some numb er n so all examples are drawn from X Note that X is a

nite set We assume these examples are given according to some unknown

n

probability distribution P on X which reects the particular interests of the

n n

teacher If S X we let PS denote the probability that a memb er of X

P

P s Now that is drawn according to P is a memb er of S ie PS

sS

supp ose the student has develop ed a certain concept g Then in the test phase

n

he will misclassify some ob ject x X i x f g Thus we can say that

g is approximately correct if the probability that such a misclassied ob ject is

given during the test phase is small

P f g

where is called the error parameter For instance if then

n

there is a chance of at most that an arbitrary given test ob ject from X

will b e classied incorrectly Note that the set of examples that is given as well

as the evaluation of approximate correctness of the learned concept g dep ends

on the same probability distribution P This formally reects the fairness

requirement that the student is taught and tested by the same teacher

After all these preliminaries we can now dene a PAC algorithm as an

algorithm which under some unknown distribution P and target concept f

learns a concept g which is probably approximately correct Probably here

means with probability at least where is called the condence

parameter For instance if and the algorithm would b e run an innite

numb er of times at least of these runs would output an approximately

correct concept The constants and n are given by the user as input to the

algorithm

Denition A learning algorithm L is a PAC algorithm for a concept class

F over domain X if

L takes as input real numb ers and a natural numb er n N

where is the error parameter is the condence parameter and n is

the length parameter

L may call the pro cedure Example each call of which returns an example

for some unknown target concept f F according to an arbitrary and

n

unknown probability distribution P on X

n

For all concepts f F and all probability distributions P on X L

outputs a concept g such that with probability at least P f g

3

A PAC algorithm may b e randomized which means informally that it may

toss coins and use the results in its computations One further technicality

I abbreviates if and only if

CHAPTER COMPUTATIONAL LEARNING THEORY

a PAC algorithm should b e admissible meaning that for any input n for

any sequence of examples that Example may return and for any concept g

the probability that L outputs g should b e well dened

Sample Complexity

Having a PAC algorithm for a concept class F is nice but having an ecient

PAC algorithm for F is even nicer In this section we analyze this eciency in

terms of the number of examples the algorithm needs the sample complexity

while in the next section we treat the number of steps the algorithm needs to

take time complexity

The sample complexity of a learning algorithm can b e seen as a function

from its inputs and n to the maximum numb er of examples that the

algorithm reads when learning an unknown target concept under an unknown

probability distribution Since the examples are drawn according to a proba

bility distribution dierent runs of the same algorithm with the same input

and the same target concept and distribution may still read dierent examples

Thus dierent runs of the same algorithm with the same input may need a

dierent numb er of examples in order to nd a satisfactory concept There

fore the sample complexity as dened b elow relates to the maximum numb er

of examples over al l runs of the algorithm with the same input

Denition Let L b e a learning algorithm for concept class F The sample

complexity of L is a function s with parameters and n It returns the

maximum numb er of calls of Example made by L for all runs of L with

n

inputs n for all f F and all P on X If no nite maximum exists we

let s n 3

Of course for the sake of eciency we want this complexity to b e as small as

p ossible A concept class is usually considered to b e eciently PAC learnable

as far as the required numb er of examples is concernedif there is a PAC

algorithm for this class for which the sample complexity is b ounded from ab ove

by a p olynomial function in and n Of course even p olynomials

may grow rather fast consider n but still their growth rate is much more

mo derate than for instance exp onential functions

Denition A concept class F is called polynomialsample PAC learnable

if a PAC algorithm exists for f which has a sample complexity b ounded from

ab ove by a p olynomial in and n 3

Note that p olynomialsample PAC learnability has to do with the worst case

if the worst case cannot b e b ounded by a p olynomial a concept class is not

p olynomialsample PAC learnable even though there may b e PAC algorithms

which take only a small p olynomial numb er of examples on average

A crucial notion in the study of sample complexity is the dimension named

after Vapnik and Chervonenkis VC

Denition Let F b e a concept class on domain X We say that F shatters

S

a set S X if ff S j f F g ie if for every subset S of S there is

an f F such that f S S 3

SAMPLE COMPLEXITY

Denition Let F b e a concept class on domain X The VapnikChervo

nenkis dimension VCdimension of F denoted by D F is the greatest

VC

integer d such that there exists a set S X with jS j d that is shattered by

F D F if no greatest d exists 3

VC

S S

Note that if F then F shatters S Thus if F for some nite set

S then F has jS j as VCdimension

Example Let X f g and F ffg fg fg f g f g f g

f g f gg b e a concept class Then F shatters the set S f g

S

b ecause ff S j f F g f fg fg f gg Thus F s shattering of

S intuitively means that F breaks S into all p ossible pieces

F also shatters S f g b ecause ff S j f F g f fg fg

S

fg f g f g f g f g g F do es not shatter S f g

since there is for instance no f F with f S f g In general there is

no set of four or more elements that is shattered by F so we have D F

VC

jS j

n

Since we are actually dealing with X rather than with X itself we need

n

the following denitions which pro ject the VCdimension on X

n n n

Denition The projection of a concept f on X is f f X The

n n n

projection of a concept class F on X is F ff j f F g 3

Denition Let F b e a concept class on domain X F is of polynomial

n

VCdimension if D F is b ounded from ab ove by some p olynomial in n

VC

3

The following fundamental result due to Blumer Ehrenfeucht Haussler

and Warmuth BEHW states the relation b etween p olynomialsample PAC

learnability and the VCdimension For a pro of we refer to Theorem

of Nat

Theorem Let F be a concept class on domain X Then F is polynomial

sample PAC learnable i F is of polynomial VCdimension

In the pro of of the theorem the following imp ortant lemma is proved

Lemma of Nat

Lemma Let F be a concept class on a nite domain X If d D F

VC

then

d d

jF j jX j

n

Let us use this to obtain b ounds on jF j Supp ose the length parameter n

is given and the domain X is built from an alphab et which contains s

characters Since we are only concerned with elements of the domain of length

i

at most n and the numb er of strings of length i over is s we can upp er

b ound jX j as follows

n n

jX j s s s s

CHAPTER COMPUTATIONAL LEARNING THEORY

n

Putting d D F and substituting our upp er b ound on jX j in the relation

VC

of the lemma we obtain the following relation

d n nd

jF j s

b

If we take logarithms with base on b oth sides using that log a b log a

and log we obtain

n

d log jF j n d log s

This implies that a concept class F is of p olynomial VCdimension ie d is

n

b ounded by a p olynomial in n i log jF j can b e b ounded by some p olynomial

n pn n

pn i jF j can b e b ounded by Thus if we are able to show that jF j

is b ounded in this way we have thereby shown it to b e p olynomialsample PAC

n pn

learnable And conversely if jF j grows faster than for any p olynomial

pn then F is not p olynomialsample PAC learnable

Time Complexity

In outline the analysis of time complexity is similar to the analysis of sample

complexity the time complexity of a learning algorithm is a function from its

inputs to the maximum numb er of computational steps the algorithm takes on

these inputs Here we assume that the pro cedure Example takes at most some

xed constant numb er of steps Again we are mainly interested in the existence

of algorithms which have a p olynomiallyb ounded time complexity

Representations

Unfortunately things are somewhat more complicated than in the last section

the numb er of examples that an algorithm needs is unambiguous but what

ab out the numb er of computational steps What counts as a computational

step In order to make this notion precise we have to turn to some precise

mo del of computation where it is clear what a single step is Usually Turing

machines are used for this We will not go into details but will just note here

that a Turing machine programmed to learn some concept will often not b e able

to output the learned concept g itself eciently b ecause this concept may b e

to o large or even innite Therefore instead of the concept g itself the Turing

machine will have to output some nite representation of g which we call a

name of g Abstractly a representation sp ecies the relation b etween concepts

and their names

Denition Let F b e a concept class and a set of symb ols denotes

the set of all nite nonempty strings over A representation of F is a function

where we require that for each f F R f and for every R F

See HU BJ for an intro duction into Turing machines Since Turing machines cannot

represent arbitrary real numb ers we have to restrict the parameters and somewhat for

instance by only allowing them to b e the inverses of integers

TIME COMPLEXITY

distinct f g F R f R g For each f F R f is the set of names of

f in R

The length of a name r R f is simply the string length of r ie the

numb er of symb ols in r The size of f in R is the length of the shortest name

in R f denoted by l f R 3

min

The requirement that R f for each f F means that each concept in

F has at least one name while R f R g for every distinct f g means

that no two distinct concepts share the same name Note the dierence b etween

the string length of a string x X and the size of a concept f F in R the

latter dep ends on R the former do es not

The aim of the analysis of time complexity is to b e able to b ound by a

p olynomial function the numb er of steps needed for learning After all we are

interested in ecient learning However if a learning algorithm provided us

with a name of an approximately correct concept in a p olynomial numb er of

steps but we were not able to decide in p olynomial time whether that concept

actually contains a given x X we still had a computational problem There

fore a representation R should b e polynomial ly evaluable given an x X and

a name r of a concept f we should b e able to nd out in p olynomial time

whether x f This is dened as follows

Denition Let R b e a representation of a concept class F over domain

X We say that R is evaluable if there exists an algorithm which for any

f F takes any x X and any name r R f as input and decides in

a nite numb er of steps whether x f R is polynomial ly evaluable if there

is such an algorithm whose running time is b ounded by a p olynomial in the

lengths of x and r 3

In the sequel whenever we write representation we actually mean a poly

nomial ly evaluable representation

PolynomialTime PAC Learnability

In order to b e able to study time complexity we need to change the denition of

a PAC learning algorithm somewhat to incorp orate the representation a PAC

algorithm for a concept class F in representation R should output the name of

a concept g rather than g itself

Now time complexity can b e dened as follows where we intro duce a new

parameter l that b ounds the size of the concepts considered

Denition Let L b e a learning algorithm for concept class F in repre

sentation R The time complexity of L is a function t with parameters n

and l It returns the maximum numb er of computational steps made by L for

all runs of L with inputs n l for all f F such that l f R l and all

min

n

P on X If no nite maximum exists we dene t n l 3

Denition A concept class F is called polynomialtime PAC learnable

in a representation R if a PAC algorithm exists for f in R which has a time

complexity b ounded from ab ove by a p olynomial in n and l 3

CHAPTER COMPUTATIONAL LEARNING THEORY

Let us supp ose we have some concept class F of p olynomial VCdimension

Then we know F is p olynomialsample PAC learnable so we only need a p oly

nomial numb er of examples Now in order to achieve p olynomialtime PAC

learnability of F it is sucient to have an algorithm that nds in a p olyno

mial numb er of steps a concept that is consistent with the given examples A

concept is consistent if it contains all given p ositive examples and none of the

given negative examples

Denition Let g b e a concept and S b e a set of examples We say g is

consistent with S if x g for every x S and x g for every x S

3

An algorithm which returns a name of a concept that is consistent with

a set of examples S is called a tting since it ts a concept to the given

examples As always we want a tting to work eciently The running time

of the tting should b e b ounded by a p olynomial in two variables The rst is

the length of S which we dene as the sum of the lengths of the various x X

that S contains The second is the size of the shortest consistent concept For

this we will extend the l notation as follows If S is a set of examples then

min

l S R is the size of the concept f F with smallest size that is consistent

min

with S If no such consistent f F exists then l S R

min

Denition An algorithm Q is said to b e a tting for a concept class F

in representation R if

Q takes as input a set S of examples

If there exists a concept in F that is consistent with S then Q outputs a

name of such a concept

If Q is a deterministic algorithm such that the numb er of computational steps

of Q is b ounded from ab ove by a p olynomial in the length of S and l S R

min

then Q is called a polynomialtime tting 3

As the next theorem Theorem of Nat shows the existence of such a

tting is indeed sucient for the p olynomialtime PAC learnability of a concept

class of p olynomial VCdimension

Theorem Let F be a concept class of polynomial VCdimension and R

be a representation of F If there exists a polynomialtime tting for F in R

then F is polynomialtime PAC learnable in R

Conversely it is also p ossible to give a necessary condition for p olynomial

time PAC learnability in terms of socalled randomized p olynomialtime ttings

We will not go into that here see Theorem of Nat but just mention

that it can b e used to establish negative results if no such tting for F in R

exists F is not p olynomialtime PAC learnable in R

Example Consider an innite sequence of prop erties p p For con

creteness supp ose the rst prop erties of this sequence are the following

TIME COMPLEXITY

p is a mammal

p is green

p is grey

p is large

p is small

p has a trunk

p smells awful

Let us identify an animal with the set of its prop erties Then we can roughly

represent an animal that is an individual animal not a sp ecies by a nite

binary string ie a nite sequence of s and s where the ith bit is i

the animal has prop erty p Here we assume that a binary string of length n

i

tells us whether the animal do es or do es not have the prop erties p p

n

while it tells us nothing ab out the further prop erties p p Thus for

n n

instance some particular small green awfully smelling trunkless mammal

could b e represented by the string Note that not every binary string

can represent an animal For instance would b e an imp ossible mammal

which is b oth green and grey at the same time Similarly since an animal

cannot b e large and small at the same time a binary string cannot have at

b oth the th and the th bit

Let us supp ose our domain X is a set of binary strings each of which

represents some particular animal Then simplifying matters somewhat we can

identify a sp ecies with the set of animals that have the essential characteristics

of that sp ecies Thus for instance the concept elephant would b e the set of all

large grey mammals with a trunk all strings in X that have p ossibly among

others prop erties p p p and p

How could an algorithm learn the target concept elephant Well it would

receive p ositive and negative examples for this concept strings from X together

with a lab el indicating whether the animal represented by the string is an

elephant or not Hop efully it would nd out after a numb er of examples that

a string is an elephant i it has p ossibly among others prop erties p p p

and p Consider the following representation a conjunction of p s is a name

i

of the concept consisting of all strings which have p ossibly among others the

prop erties in the conjunction Then the conjunction p p p p would b e

a name of the concept elephant and the learning algorithm could output this

conjunction as a name of the concept it has learned

It turns out that the concept class that consists of concepts representable by

a conjunction of prop erties is p olynomialtime PAC learnable see Example

of Nat Thus if F is a concept class each memb er of which is a sp ecies

of animals that can b e represented by some nite conjunction of prop erties

then a p olynomiallyb ounded numb er of examples and a p olynomiallyb ounded

numb er of steps suces to learn some target concept sp ecies approximately

correctly In fact much more complex concept classes are p olynomialtime

PAC learnable as well For an overview of p ositive and negative results see

Nat AB KV

CHAPTER COMPUTATIONAL LEARNING THEORY

Some Related Settings

The standard PAC setting of the previous sections may b e varied somewhat

In this section we will mention some alternatives

PolynomialTime PAC Predictability

In the ordinary PAC setting a PAC algorithm for a concept class F reads

examples from an unknown target concept f from F and has to construct a

concept g also from F which is approximately correct This may lead to a

seemingly paradoxical situation we would exp ect that learning a sup erset of

F is at least as hard as learning F itself but this need not b e the case in the

ordinary PAC setting Namely it may b e that there is no p olynomialtime PAC

algorithm for some concept class F in some representation R while for some

larger concept class G F there is such a p olynomialtime PAC algorithm

The latter algorithm when given examples for some target concept f F

always constructs a name of a probably approximately correct concept g G in

p olynomial time Still F itself may b e hard to learn b ecause the requirement

that the output concept should b e a memb er of F may b e very hard to meet

We can take this into account by lo osening the requirement on g somewhat

and allow it to b e a memb er of a broader concept class G of which F is a

subset This gives the learning algorithm more freedom which may facilitate

the learning task Supp ose we have a concept class F a broader concept class

G F and a representation R of G which is of course also a representation

of F Supp ose furthermore that there exists a learning algorithm L for F

in R which is just like a PAC algorithm for F in R except that it outputs a

name of a concept g such that g G but not necessarily g F In this case

we say that L is a PAC prediction algorithm for F in R in terms of G and F

is PAC predictable in R in terms of G If furthermore the time complexity of

algorithm L is b ounded by a p olynomial in n and l we say that F

is polynomialtime PAC predictable in R in terms of G If some G exists such

that F is p olynomialtime PAC predictable in R in terms of G we will simply

say that F is p olynomialtime PAC predictable in R

Clearly if some concept class F is p olynomialtime PAC learnable in some

R it is also p olynomialtime PAC predictable in R simply put G F Hence

the setting of p olynomialtime PAC predictability may b e used to establish

negative results if we can prove that some concept class F is not p olynomial

time PAC predictable in R in terms of any G we have thereby also shown that

F as well as any sup erset of F is not p olynomialtime PAC learnable in R

The converse need not hold some classes are p olynomialtime PAC predictable

but not p olynomialtime PAC learnable see Sections and of KV for

an example Hence p olynomialtime PAC predictability is strictly weaker than

p olynomialtime PAC learnability

SOME RELATED SETTINGS

Memb ership Queries

We may facilitate the learning task by allowing a PAC algorithm to make use of

various kinds of oracles An oracle is a device which returns answers to certain

questions which are called queries For the PAC algorithm that uses an oracle

the oracle is like a black b ox you p ose a question and get an answer but do

not know how the oracle constructs it answer Like the Example pro cedure

oracles are assumed to run in at most some xed constant numb er of steps

The most straightforward kind are the membership queries Here the oracle

takes some x X as input and returns yes if x is a memb er of the target

concept and no if not Clearly the oracle somehow has to have knowledge

ab out the domain Two justications for assuming an oracle can b e given

Induction can b e compared with a simplied picture of the work of a

scientist Consider a particle physicist The physicist may not know the

general laws that characterize the ob jects in his domain of inquiry but he

can obtain knowledge ab out certain sp ecic instances of those concepts

by doing exp eriments Posing a memb ership query to an oracle is similar

to doing an exp eriment in science which is like p osing a question to

nature

When learning a student may have a teacher who can answer questions

ab out whether some ob ject has a certain prop erty or not It need not b e

the case here that the student only learns what the teacher already knows

We only assume the teacher has sucient knowledge of individual ob jects

of the concepts The teacher may know all ab out the particular instances

of the concepts and yet b e pleasantly surprised by the concept that a

smart student comes up with Translating this analogy to induction the

oracle acts as the teacher while the learning algorithm is the student

If a concept class F is p olynomialtime PAC learnable in some R by an

algorithm which makes memb ership queries we will say that F is p olynomial

time PAC learnable in R with membership queries Analogously we can dene

PAC predictability with memb ership queries Note that if an algorithm makes

memb ership queries it in a way creates its own examples Note also that a

p olynomialtime algorithm can make at most a p olynomial numb er of queries

since each query counts for at least one computational step

Equivalence queries need a more fancy oracle which is discussed in the next

subsection For an overview of other kinds of queries we refer to Ang

Identication from Equivalence Queries

While p olynomialtime PAC predictability is strictly weaker than p olynomial

time PAC learnability polynomialtime identication from equivalence queries

intro duced by Angluin Angb is strictly stronger In this setting we have

an oracle which takes a name of a concept g as input and answers yes if g

equals the target concept f and no otherwise In case of a no it also returns

a randomly chosen counterexample x f g There is no need for the oracle to

provide the correct lab el of the counterexample x b ecause the algorithm can

CHAPTER COMPUTATIONAL LEARNING THEORY

nd this out for itself if x g then x f and if x g then x f When

equivalence queries are available the requirement that an algorithm outputs a

name of an approximately correct concept is replaced by the requirement that

the target concept is identied exactly an algorithm that is allowed to make

equivalence queries should output a name of the target concept

Consider a concept class F and a representation R of F Let L b e an algo

rithm which uses equivalence queries in order to learn some unknown concept

f F under some unknown probability distribution P and which takes as

input an upp er b ound l on l f R and an upp er b ound n on the length of

min

the counterexamples from the oracle If this algorithm always outputs a name

of the target concept we say F is identiable from equivalence queries in R If

the running time of the algorithm L is b ounded by a p olynomial in its inputs l

and n then F is polynomialtime identiable from equivalence queries in R As

in the case of memb ership queries an algorithm with a p olynomiallyb ounded

running time can make only a p olynomiallyb ounded numb er of equivalence

queries

It is shown in Section of Ang that if a concept class is p olynomialtime

identiable from equivalence queries in some R then it is also p olynomialtime

PAC learnable in R The converse do es not hold Thus while PAC predictabil

ity can b e used to establish negative results identication from equivalence

queries may b e used for p ositive results if we can prove that some concept class

F is p olynomialtime identiable from equivalence queries we have thereby also

shown that F as well as any subset of F is p olynomialtime PAC learnable in

R

We may also allow an algorithm to make b oth equivalence queries and

memb ership queries Angluin Angb calls this combination a minimally

adequate teacher if a teacher wants to teach some target concept to his stu

dent he should b e able to answer students questions ab out whether some

ob ject is in the target concept memb ership queries and he should b e able

to judge whether some concept the student comes up with is really the target

concept and give a counterexample if not equivalence queries If p olynomial

time identication of F from equivalence queries is done by an algorithm which

makes use of equivalence queries as well as memb ership queries then we say F

is polynomialtime identiable from equivalence and membership queries in R

This implies p olynomialtime PAC learnability with memb ership queries

Learning with Noise

In many realworld learning tasks examples may contain errors noise which

may for instance b e due to inaccurate measurements There are various ways in

which the analysis of noise may b e mo delled in the theoretical setting for PAC

learnability We will discuss only two kinds of noise here Valiants malicious

noise Val also sometimes called adversarial noise and Angluin and Lairds

random classication noise AL For other kinds of noise see Lai Slo

Firstly in the malicious noise mo del a malicious adversary of the learning

algorithm tinkers with the examples for each example that the learning algo

rithm reads there is a xed unknown probability that the adversary has

SUMMARY

changed the original correct example x y to any other x y pair he cho oses

Since y may not b e the correct lab el for x the adversary may intro duce noise

in this way The adversary is assumed to b e omnip otent and omniscientin

particular he has knowledge of the learning algorithm he is trying to deceive

This means that the learning algorithm should b e able to cop e even with the

worst p ossible changes in the examples

Secondly in the random classication noise mo del the Example pro cedure

is replaced by a pro cedure Example and there is a xed unknown probability

that the lab el of an example provided by this pro cedure is incorrect

For instance supp ose If a learning algorithm receives an example x y

from Example then there is a probability of that y is incorrect

In b oth mo dels the actual noise rate is unknown to the learning algo

rithm However an upp er b ound on the noise rate is given as an additional

b

input parameter to a PAC algorithm where This is

b b

added as a parameter to the time complexity function as well If there is a

PAC algorithm for a concept class F in some representation R working in the

presence of malicious resp random classication noise with time complexity

b ounded by a p olynomial in n l and then F is said to

b

b e p olynomialtime PAC learnable in R with malicious resp random classi

cation noise Similarly we can dene PAC predictability with malicious or

random classication noise

Summary

This thesis is concerned with computational learning theory the study of algo

rithmic ways to learn from examples The dominant formal mo del of learnability

in Articial Intelligence is Valiants mo del of approximately correct learning In

this mo del a concept is simply a subset of a domain X and a concept class

is a set of concepts A PAC algorithm reads examples for an unknown target

concept taken from some concept class drawn according to an unknown prob

ability distribution and learns with tunably high probability a tunably go o d

approximation of the target concept A concept class F is polynomialsample

PAC learnable if a PAC algorithm exists for F that uses only a p olynomially

b ounded numb er of examples and is polynomialtime PAC learnable if the al

gorithm uses only a p olynomiallyb ounded numb er of steps In the latter case

the algorithm should output a name of the learned concept in some p olynomi

ally evaluable representation An oracle for membership queries can inform

the learner whether a sp ecic ob ject is in the target or not Polynomial

time PAC predictability is weaker than p olynomialtime PAC learnability while

p olynomialtime identication from equivalence queries is stronger When noise

is involved the examples may sometimes b e incorrect

CHAPTER COMPUTATIONAL LEARNING THEORY

Chapter

Application to Language

Learning

Intro duction

In the course of the th century language has b ecome the fo cal interest of phi

losophy Many of the central philosophical problemslogical epistemological

anthrop ological ethical and even metaphysicalare b ound up with the intri

cacies of human language Investigating how human b eings learn languages

and which languages can b e learned by human b eings may tell us a lot ab out

those intricacies and should therefore b e of great imp ortance to philosophy

Seeking knowledge ab out language where b etter to turn than to linguistics

the science of language And within linguistics whom b etter to turn to than

Noam Chomsky the man who made linguistics the most scientic of all hu

manities One of Chomskys most interesting claims concerns the innateness

of imp ortant asp ects of our natural languages For this claim he has b een

severely attacked by various empiricist philosophers among them Putnam and

Quine While Chomsky argues that children can only acquire language in virtue

of having sp ecic innate language acquisition mechanisms Putnam and Quine

argue that general not languagesp ecic learning mechanisms may suce for

language acquisition

This chapter is an attempt to settle this issue in Chomskys favour by means

of a mathematical argument we will use results from computational learning

theory to establish that children would not b e able to learn their native language

if they started without any preknowledge of the language they have to learn

In other words we provide a formal pro of that general learning mechanisms

cannot explain why children acquire language as successfully as they in fact do

language acquisition must b e based on certain prop ensities and biases which

direct the child towards certain kinds of languages and away from others Where

do these biases come from The most plausible answer is that they are innate

The chapter is organized as follows We start by sketching some background

concerning Chomskyan linguistics and the innateness of language In order to

b e able to state formal results ab out language learnability we need to pro

vide two things a formal mo del of learning and a formal mo del of languages

CHAPTER APPLICATION TO LANGUAGE LEARNING

and grammar The rst has b een dealt with in the previous chapter while

Section provides formal counterparts to the notions of a language and a

grammar One of the main selfimp osed goals of philosophers is to bring out

presupp ositions In Section we follow this laudable practice making ex

plicit the main assumptions and presupp ositions of our analysis of language

learnability Sections to form the main part of the chapter Here we

show that the set of languages a child can learn must b e severely constrained

in order to enable ecient learning to take place Finally in Section we

extrap olate this argument to other kinds of learning

A Brief History of the Chomskyan Revolutions

In this section we will give a brief and incomplete overview of the revolution

caused in linguistics by the work of Noam Chomsky Actually we may distin

guish b etween two revolutions the rst replaced the b ehavioristic paradigm by

the paradigm of transformationalgenerative grammar the second involved im

p ortant changes in the transformationalgenerative framework yielding Chom

skys current principlesandparameters framework

Against Behaviorism

Linguistics BC Before Chomsky was dominated by b ehaviorism Accordingly

a go o d place to start our story is Chomskys review of Verbal Behavior a b o ok

by the leading b ehaviorist B F Skinner In his b o ok Skinner attempted to

extend the b ehaviorist approach to the study of language use by human b eings

The b ehaviorist picture of science amounts to the following you have some

ob ject for instance an animal which reacts or resp onds in certain ways to

certain stimuli from the environment The task of the scientist is to nd laws

which describ e the relations b etween stimulus and resp onse on the basis of

exp eriments where you vary the stimulus and observe how the resp onse changes

Previously that approach had mainly b een restricted to very small contexts for

instance rats in mazes where b ehaviorist concepts like stimulus resp onse

reinforcement could b e precisely dened by reference to simple measuring

apparatus

Chomskys critique of Skinners b o ok was simple yet eective the extrap

olation of the b ehaviorist approach to the area of human language use leaves

the key b ehaviorist concepts empty The problem for the b ehaviorist is what

are the stimulus and the response in case of linguistic behavior In order to

give results the b ehaviorist approach requires a precise denition of things like

stimulus and resp onse as well as the ability to somehow measure or determine

those things In a simple lab oratory exp eriment with animals this can indeed

b e done However if we supplant the lab oratory terminology to the much more

complex case of language learning it either b ecomes nonapplicable if we take

the terminology literally or empty if we take it metaphorically Skinners

This summary is mainly based on a numb er of recent linguistic texts Bot New

Har Pin to which we refer the reader for more detail

A BRIEF HISTORY OF THE CHOMSKYAN REVOLUTIONS

attempt to extend the b ehaviorist approach to human language use failed and

so have later attempts In fact it seems plausible that the precise denitions

and ways of measuring that the b ehaviorist requires are simply unavailable in

the complex area of linguistic b ehavior

TransformationalGenerative Grammar

If language use and learning cannot b e describ ed b ehavioristically then what

A few years b efore his Skinnerreview Chomsky had himself put forward a

radically dierent linguistic theory He distinguishes b etween linguistic perfor

mance and competence Performance is the way a p erson actually or p otentially

uses language comp etence is what he p erhaps unconsciously knows ab out that

language The distinction is crucial natural languages contain an innite set

of sentences which have never b een used b efore and hence are not amenable to

b ehavioristic analysis yet which would easily b e recognized as grammatical by

any comp etent native sp eaker Because p erformance varies to o much with the

contingencies of context it is not well suited for scientic inquiry comp etence is

the appropriate target for linguistics Thus we need a mo del which sp ecies the

knowledge native sp eakers have of the set of all syntactically correct sentences

rather than the ones that have actually b een used or uttered

Throughout his career Chomskys linguistic research has b een motivated

by the following problem what makes it p ossible that almost all children ac

quire nearp erfect comp etence of their native language despite the p overty of

the stimulus they receive When children learn their rst language the only

input they receive are the sentences they hear from their parents and others

This set of sentences do es not uniquely determine a language many dierent

languages are consistent with the input the child receives Nevertheless it is

an empirical fact that children all ll in the gaps in more or less the same way

learning approximately the same language From this Chomsky concludes that

children must b e b orn with a strong linguistic bias consisting of constraints on

the set of p ossible languages These constraints Universal Grammar lead chil

dren to learn only very sp ecic languages from the input they receive ignoring

the innite numb er of other languages compatible with the input

Universal Grammar is incarnated in what Chomsky calls the language facul

ty and what others sometimes call the language organ or the language instinct

which is supp osed to b e a more or less separate mo dule in the mindbrain

Chomsky often uses the term mindbrain in order to forestall discussions of

the dualism vs monism typ e In Chomskys view the main task of linguistics

is to investigate the prop erties of Universal Grammar Thus linguistics would

give us information ab out the workings of our mindbrain In fact Chomsky

has stated at several places that he is mainly interested in linguistics not for its

own sake but b ecause it is a way to gain knowledge ab out the mind Har

p This is in sharp contrast to the earlier b ehaviorist approach which

eschewed anything mental

The term Universal Grammar is used with systematic referring b oth to the

initial innate state of the language faculty at birth and to the prop erties shared by all natural

languages For the latter see Haw

CHAPTER APPLICATION TO LANGUAGE LEARNING

Chomskys initial broad mo del of language comp etence rst describ ed in

his groundbreaking work Cho and elab orated in more painstaking detail

in Cho roughly amounts to the following Having comp etence of some lan

guage amounts to having in some sense a transformationalgenerative gram

mar for that language in your head A transformationalgenerative grammar

for a language determines the set of syntactically correct sentences of that lan

guage It consists of two parts a base part and a transformational part The

rst part generates deep structures of sentences the second part transforms

these into the surface structures that we normally would call sentences Further

op erations on deep structure were to yield the interpreted form or meaning

of a sentence while further op erations on surface structures would yield the

phonological form of a sentence However in this chapter we will ignore such

further linguistic issues restricting attention to syntax

The base part is a generative grammar which consists of a set of phrase

structure rules and a lexicon The phrase structure rules recursively sp ecify the

forms a sentence may have For example such rules might state a sentence

can consist of a noun phrase followed by a verb phrase and a noun phrase can

consist of a determiner followed by a noun The lexicon is like a dictionary

it contains the words that may b e plugged into those forms One entry might

for instance b e dog singular animate noun Inserting words into a sentence

form yields a deep structure The base part of a grammar thus generates a set

of deep structures An example of such a deep structure might b e the following

tree which gives a structural description of the sentence The dog bites the

man

Sentence

R

Verb phrase Noun phrase

R R

Noun phrase

Noun Verb Determiner

R

dog

Determiner Noun

the bites

man

the

Figure Deep structure of the sentence The dog bites the man

The transformations that make up the second part of a transformational

generative grammar map deep structures to surface structures For example

two transformations may take the ab ove deep structure into the following sur

face structures resp ectively

The dog bites the man the identity transformation

The man was bitten by the dog a passivizing transformation

A BRIEF HISTORY OF THE CHOMSKYAN REVOLUTIONS

The relationship b etween sentences and is brought out by the way they have

b een constructed b oth stem from the same deep structure

The base part of a transformationalgenerative grammarphrase structure

rules plus lexiconwas intended to b e fairly restricted in p ower hop efully only

contextfree see Section for the denition of this term Unfortunately

the transformational part of a grammar is p otentially to o expressive it can b e

shown that any grammatically describable language can b e generated by means

of a transformationalgenerative grammar PR BC even if we cho ose a

very simple xed base part This means that the transformationalgenerative

framework by itself including the choice of the base part do es not seem to

make any substantive claims ab out Universal Grammar Accordingly claims

ab out Universal Grammar had to b e phrased in terms of constraints on the

allowed base part and particularly transformations Eventually the problems

in formulating adequate constraints led up to a second revolution in linguistics

PrinciplesandParameters

Chomskys current principlesandparameters mo del do es away with phrase

structure rules altogether and with most of the transformations as well Cho

Cho Instead much more emphasis is placed on the information in the lexi

con than in the earlier mo del Sentences are generated directly from the lexicon

by means of the interaction of a numb er of subsystems each consisting of vari

ous general principles In this new mo del the principles are innateand hence

the same for all natural languagesexcept for some parameterized variation

Thus innate Universal Grammar sp ecies schemes of principles which have

certain op en parameters and it sp ecies the range of p ossible values those pa

rameters may take Fixing the parameters in the principle schemes yields the

principles that govern particular natural languages Therefore apart from the

lexicon each natural language can b e characterized by the particular values of

the parameters for that language Accordingly for a child learning a language

now amounts to two things determining the particular values of the pa

rameters that yield the principles of its native language and acquiring the

lexicon

The Innateness of Universal Grammar

The ab ove pages briey mentioned the philosophically most interesting of Chom

skys claims the innateness of Universal Grammar Innateness has of course

b een a topic for philosophical discussion for years Particularly in the th

and th century the debate was on The rationalists Descartes and Leibniz

are commonly considered to b e on the innate side the empiricists Lo cke and

Hume on the other Thus for instance Descartes as cited on p of Cho

takes the ideas of gures pain colour and sound to b e innate while Bo ok I of

It should b e fo otnoted that nowadays Chomskys work is much less dominant than it

was in say the s In presentday linguistics Chomskys approach is one among several

alternative approaches Other are based on for instance neural networks or various kinds of

constraints see Sei PS and the references therein

CHAPTER APPLICATION TO LANGUAGE LEARNING

Lo ckes Essay Lo c argues against all innate notions and principles As far as

the AngloAmerican philosophical world is concerned it seems fair to say that

the empiricist side of the discussion has b een dominant It has long b een the

guiding principle of empiricist philosophy that most p ossibly all knowledge

derives from the sensesa principle which would b e seriously undermined if

imp ortant asp ects of knowledge of language turned out to b e innate

Chomskys work has revived the old debate on innateness in an up dated

guise the discussion has shifted from innate ideas to innate mechanisms As

far as language learning is concerned Chomsky explicitly sides with the ratio

nalists

In general then it seems to me correct to say that empiricist theo

ries ab out language acquisition are refutable wherever they are clear

and that further empiricist sp eculations have b een quite empty and

uninformative On the other hand the rationalist approach exempli

ed by recent work in the theory of transformational grammar seems

to have proved fairly pro ductive to b e fully in accord with what is

known ab out language and to oer at least some hop e of providing

a hyp othesis ab out the intrinsic structure of a languageacquisition

system that will meet the condition of adequacyinprinciple and do

so in a suciently narrow and interesting way so that the question

of feasibility can for the rst time b e seriously raised Cho

pp

Thus it is not surprising to nd that he has b een attacked by various con

temp orary philosophers of a more empiricist b ent such as Putnam and Quine

Unfortunately the discussion has b een hamp ered by mutual misunderstandings

see for instance Cho Qui Both p ositions are not as extreme as their

resp ective opp onents sometimes take them to b e As Rosemont Ros notes

the Chomskyans certainly acknowledge that a child develops language from ex

p eriental data though they would p erhaps prefer to say that language grows

in a child triggered by exp erience rather than that is b eing learned from ex

p erience On the other hand most presentday empiricists have watered down

their empiricism somewhat and would agree that we are not b orn as a tabula

rasa but have various innate capacities and biases An example of the latter

would b e a general innate measure of similarity used for induction Qui

Perhaps the b est way to state the debate is as follows While Quine and

other empiricists are willing to admit innate biases and learning mechanisms

these are generalpurpose mechanisms Chomsky on the other hand p ostulates

innate mechanisms which are specic for language In other words while em

piricists still hang on to the idea that children acquire language by means of

the same generalpurp ose learning mechanisms that enable them to learn eg

Putnam certainly was an empiricist at the time he wrote his critique of Chomsky Put

Put but seems harder to classify now Quine is still as empiricist as he ever was

However it is interesting to note that of the fteen replies that Quine wrote in DH the

one to Chomsky is the longest and most detailed This suggests that Quine to ok Chomskys

criticisms more seriously than those made by other philosophers

FORMALIZING LANGUAGES AND GRAMMARS

to recognize faces or to wash their hands after dinner Chomsky disp elled with

this idea

In the remainder of this chapter we will see that results from computational

learning theory indicate that Chomsky is in the right here In particular some

preknowledge of the language that is to b e learned is required in order to enable

ecient language acquisition and this preknowledge is most likely innate It

is p erhaps surprising that this conclusion can b e established by means of a

mathematical argument To b e sure Chomsky himself writes the following

ab out the innateness of particular linguistic mechanisms

You cant demonstratively prove it is innatethat is b ecause we

are dealing with science and not mathematics even if you lo ok at the

genes you couldnt prove that In science you dont have demonstra

tive inferences in science you can accumulate evidence that makes

certain hyp otheses seem reasonable and that is all you can do

otherwise you are doing mathematics

Cho p as cited in Bot p

Indeed we cannot formally prove that certain specic grammatical principles

must b e innate Still what we can provegiven the presupp ositions outlined

in Section is that some bias must b e present in order to make language

learning feasible and this bias is probably innate since there is no plausible

alternative source where it might have come from

Formalizing Languages and Grammars

Since the aim of this chapter is to give a formal pro of of the necessity of innate

grammatical biases in language learning we need to formalize the notions of

language and grammar This is the job of the next three subsections

Formal Languages

Let us consider some xed alphab et for instance the set consisting of the

letters a z in lower as well as upp er case the digits

and some interpunction symb ols like and blank space For any set

of symb ols S we use S to denote the set of all nite strings concatenations

of symb ols from S and S to denote the set of all nite nonempty strings

A sentence will simply b e a memb er of ie a nite nonempty string of

symb ols from

We can only sp eak of grammatically correct or incorrect sentences relative

to a language We will take a language to b e simply a p ossibly innite set L of

sentences any subset of is a language A sentence s is grammatically correct

for L if s L and grammatically incorrect otherwise Thus for instance the

English language is simply the set E of all English sentences The sentence The

dog ate my homework would b e grammatically correct in English it would b e

This is similar to making grammatical correctness relative to a grammar given a grammar

G for a language L a sentence s is grammatically correct i G generates s

CHAPTER APPLICATION TO LANGUAGE LEARNING

a memb er of E while the sentences dog the blab and De hond heeft mijn

huiswerk opgegeten would not Of course for natural languages the b oundaries

b etween grammatically correct and incorrect sentences are not that sharp In

fact Chomsky Cho has suggested that grammaticality may b e a matter of

degree though he has not added much subsequent esh to this suggestion In

this chapter we will assume grammaticality to b e a sharp b oundary

Formal Grammars

A common way to sp ecify a language is by giving its grammar We will dene

a grammar as a set of rules called productions Using pro ductions a grammar

generates sentences starting from an initial symb ol S for sentence Let N

b e a nite set of symb ols called nonterminals N should at least contain the

symb ol S The symb ols in the alphab et are called terminals and we will

assume and N to b e disjoint In order to distinguish typ ographically non

terminals from terminals we will write down nonterminals in a b old facetyp e

A production is something of the form A B where A and B are nite

strings over N and A contains at least one nonterminal A will b e referred

to as the lefthand side of the pro duction and B as the righthand side A gram

mar G is a nite set of pro ductions such that at least one of the pro ductions

in G has S as lefthand side

How do es a grammar relate to a language The pro ductions in a grammar

G function as rewriting rules which allow you to replace in some string the

lefthand side of a pro duction by the righthand side of that pro duction G can

generate a sentence a string of terminals as follows

Start with the string S S

Rep eat the following

Find a pro duction A B G such that A o ccurs somewhere in the

string S

Apply the pro duction replace one o ccurrence of A in S by B

until S is a sentence ie contains only terminals

The language generated by a grammar G denoted by LG is the set of all

sentences which can b e generated in this way

An example will make this clearer Consider a set N consisting of the three

nonterminals S for Sentence N for Name and V for Verb phrase Let G

b e the grammar consisting of the following ve pro ductions

S N V

N John

N Paul

A grammar is often dened more formally as a tuple G N P S where N is the

set of nonterminals is the set of terminals P is the set of pro ductions and S is the starting

symb ol We have simplied this to G P b ecause our starting symb ol will always b e S and

the sets N and can b e read o from the set of pro ductions

FORMALIZING LANGUAGES AND GRAMMARS

V hates N

V thinks that S

We do not make a formal distinction b etween the phrase structure rules of a

grammar and its lexicon b oth are represented by means of pro ductions The

rst pro duction states that a sentence consists of a name followed by a verb

phrase The names can b e either John or Paul The last two pro ductions

show how the verb phrase V can b e expanded Note that the second of these

two pro ductions intro duces the nonterminal S again This grammar can for

instance generate the sentence John hates Paul as follows

S N V John V John hates N John hates Paul

We start with S and apply pro ductions until we end up with a string without

i

nonterminals The application of pro duction i is denoted ab ove by Such a

sequence of applications of pro ductions is called a derivation of the string from

the grammar Note that a derivation corresp onds to the kind of structural

description that is emb o died in the tree on p

Similarly G can generate the sentences John hates John Paul hates

John and Paul hates Paul We can also use G generate the more complex

sentence Paul thinks that John hates Paul the derivation of which is

S N V Paul V Paul thinks that S Paul thinks that N

V Paul thinks that N hates N Paul thinks that John hates

N Paul thinks that John hates Paul

LG the language generated by G is the set of all sentences which can b e

generated in this way Note that LG is innite due to the reintro duction of S

in the fth pro duction the language contains the sentences John thinks that

Paul hates John John thinks that Paul thinks that John hates Paul etc It

should b e clear that the language generated in this way is only a tiny subset of

a natural language eg English On the other hand the example shows that

smalland yet innitefragments of language can already b e generated with

very simple grammars This holds out the hop e that larger parts of natural

languages can b e generated by larger grammars and p erhaps it is even p ossible

to give a complete grammar for a natural language

The Chomsky Hierarchy

Clearly some languages are more complex than others The complexity of a

language is related to the complexity of the simplest grammar which generates

that language Below we dene the Chomsky hierarchy consisting of the classes

of Typ e Typ e Typ e and Typ e languages with increasing grammatical

complexity

A Type or regular grammar contains only pro ductions in which the left

hand side is a nonterminal and the righthand side is either a terminal

or the concatenation of a terminal and a nonterminal

CHAPTER APPLICATION TO LANGUAGE LEARNING

A Type or contextfree grammar contains only pro ductions in which the

lefthand side is a nonterminal while the righthand side is an arbitrary

nonempty string of terminals and nonterminals

A Type or contextsensitive grammar contains only pro ductions in

which the righthand side is at least as long as the lefthand side

A Type grammar may contain any kind of pro ductions

A language is of Typ e i i if it can b e generated by a grammar

of Typ e i For instance the JohnPaulgrammar is a Typ e contextfree

grammar and hence generates a Typ e contextfree language Of what Typ e

are full natural languages This is a question to which we will return later

We will now informally state some imp ortant results from the theory of

formal languages for technical details and pro ofs see HU Firstly it can b e

shown that the class of Typ e languages is a prop er subset of the class of Typ e

languages Similarly Typ e is a prop er subset of Typ e and Typ e is a prop er

subset of Typ e Typ e languages are recursive there exists an algorithm for

deciding whether a given sentence is a memb er of the language generated by

a given Typ e grammar Typ e languages are recursively enumerable there

exists an algorithm which enumerates the p ossibly innite set of sentences in

the language generated by a given Typ e grammar Memb ership of a sentence

in a language is semidecidable but not always decidable for a given Typ e

language Typ e grammars are equivalent to Turing machines in the sense

that a language is of Typ e i it is accepted by some Turing machine Finally

even though the class of Typ e languages is the broadest class in the hierarchy

it still do es not comprise al l p ossible languages some languages are not Typ e

languages

Simplifying Assumptions for the Formal Anal

ysis

In the next sections we will give a formal analysis of language learning from

example sentences in the PAC setting The main ob jective there is to show

that unbiased language learning is just to o hardand hence cannot b e what

children actually do From this it would follow that children must have some

biases which inuence the language they learn and the way they learn it

It will b e clear to all but the most naive readers that the real world is simply

to o big to mo del completely Inevitably a formal analysis involves a numb er of

A contextsensitive grammar may equivalently b e dened as a set of pro ductions of the

form A B Here A B is simply a contextfree pro duction which however may

only b e applied as a rewriting rule in case A is surrounded by on the left and on the

sp ecies the context in which the rule may b e applied andor may right That is

b e empty

A very quick pro of of this the set of all sentences is denumerably innite

the set of all languages is the p ower set of ie the set of all sets of sentences and is

therefore uncountable the set of all grammars is only denumerably innite thus there

are more languages than there are grammars which implies that some languages cannot

b e generated by any grammar and hence are not Typ e languages

SIMPLIFYING ASSUMPTIONS FOR THE FORMAL ANALYSIS

simplications which have the disadvantage of making the analysis less real

istic and corresp ondingly less plausible On the other hand simplifying away

a numb er of the contingencies and noise of the real world may bring out more

clearly what really matters for language learning Let us state and defend right

at the outset the main simplifying assumptions we will make here

Language Learning Is Algorithmic Grammar Learning

A rst assumption is that the pro cess by which a child learns a language can

b e describ ed by an algorithm This algorithm takes sentences from some target

language ie what is to b ecome the childs native language as input and

learns a grammar for those sentences The sources of those input sentences

may b e very diverse parental sp eech dialogue from television series and so

on Actually two distinct assumptions are at work here that language

learning is algorithmic and that it involves learning a grammar We will

clarify these two assumptions separately

Firstly the algorithmic asp ect To say that a childs learning can b e de

scribed by an algorithm do es not imply that children consciously enact the

steps of some algorithm when they are learning It do es however mean that

there is an algorithm which whenever it is given the same input as the child

learns the same grammar In other words there should b e an algorithm whose

inputoutput b ehaviour is equivalent or if you will isomorphic to the childs

With this assumption we are squarely within the tradition of cognitive

science Here al l intelligent b ehaviour which includes learning is taken to

b e describable as some form of algorithmic information pro cessing A very

fundamental theoretical justication for this may b e found in the ChurchTuring

thesis Informally this thesis says that everything that can b e accomplished by

certain systematic means whatever these may b e can also b e accomplished by

an algorithm as implemented in a Turing machine see Chapter of Hof

for more on this If the ChurchTuring thesis holds then it seems that the

pro cess of language acquisition should indeed b e describable by an algorithm

Secondly what do es it mean to say that children learn a grammar It is a

well known fact that p eople are usually not able to state explicitly the grammar

of their native language they fol low the rules of that grammar without con

sciously knowing those rules That is p eople may have knowhow knowledge

of language ie they know how to use language without having know that

knowledge Thus we have to b e a bit careful when we say that children acquire

a grammar In the sequel we will say that a child has learned a grammar G

if the child by and large consistently fol lows that grammar it only utters

or writes sentences from LG In other words we will say that a child has

learned a grammar if that grammar appropriately describ es the childs linguis

tic b ehavioureven though the child itself may b e unaware of the rules of the

grammar it follows In this way a grammar provides an appropriate description

of knowhow knowledge of language

Note that it is no easy matter to nd out whether a childs linguistic b ehaviour is in

accordance with some grammar G Finding this out will usually b e a matter of induction

itself if we have observed a childs linguistic utterances for quite a while and all utterances

CHAPTER APPLICATION TO LANGUAGE LEARNING

One further caveat has to b e entered Namely the way we have formalized

grammars as nite sets of pro ductions in the previous sections is rather dif

ferent from either Chomskys transformationalgenerative grammar or his later

principlesandparameters mo del Actually a set of pro ductions formalizes only

the base part of a transformationalgenerative grammar ignoring the transfor

mations Furthermore we only fo cus on the set of sentences that a grammar

generates mainly ignoring the way the sentence is parsed by the grammar that

is the only part of trees like Figure that we are interested in is the sentence

constituted by the words at the leaves of the tree However any language with

a transformationalgenerative or a principleswithxedparameters grammar is

a Typ e language and hence representable by a nite set of pro ductions Ac

cordingly restricting attention to the learnability of sets of pro ductions do es

not really invalidate our analysis

This caveat also bars invoking semantical considerations to argue against

the present purely syntaxoriented analysis Though a full grammar would

probably let semantical issues inuence the syntax the resulting system would

still b e equivalent to some Turing machine assuming the ChurchTuring the

sis and hence could b e redescrib ed in purely syntactical terms as a Typ e

grammar Again restricting attention to the learnability of syntax ie sets of

pro ductions do es not invalidate our analysis

All Children Have the Same Learning Algorithm

Furthermore we will assume that all children can b e describ ed by the same

learning algorithm This is certainly a false presupp osition no two children are

the same so no doubt some children will pro cess sentences in a dierent way

and will learn a dierent grammar from the same input Nevertheless since the

brains of children all over the world have roughly the same structure it do es

seem fair to say that they probably have approximately the same mechanisms

for acquiring language Therefore we take this presupp osition to b e at least

approximately true

No Noise in the Input

In addition we will assume all input sentences that the child receives to b e

grammatically correct all input do es indeed conform to one single grammar

for the target language Again this is an obviously false assumption For

instance parents of very young children are notorious for the ungrammatical

go ogo ogaagaa like way they talk to their infant This is not a problem for

us however since our ob jective here is to show that unbiased language learning

is computationally intractable Since learning with noise is at least as hard as

learning without noise it will b e sucient for us to show that noiseless unbiased

learning is already to o hard

b elong to the language generated by G we may tentatively assume that the child has acquired

the grammar G x of Chapter of Cho has more on this

FORMAL ANALYSIS OF LANGUAGE LEARNABILITY

PAC Learnability Is the Right Analysis

The nal assumption is that p olynomialtime PAC learnability or somewhat

more lib erally p olynomialtime PAC predictability provides an appropriate

analysis of learnability That is we will take it that a concept class is learnable

for a child if and only if the child has an ecient p olynomialtime PAC

learning algorithm for that class For language learning this means that a class

of p ossible languages is learnable if and only if the childs language acquisition

mechanism is a p olynomialtime PAC learning algorithm for that class

Some readers may feel this to b e to o strong a requirement After all PAC

learnability is a worstcase analysis over al l p ossible probability distributions on

the domain Do we really require that a child b e able to learn a language with all

p ossible distributions over the examples some distributions are pretty weird

Mayb e not Mayb e the childs language acquisition algorithm only works for

certain probability distributions for instance those under which the most often

used sentences are also more probable to app ear as input But that would mean

that the childs learning algorithm is already biased to particular probabability

distributions and that it would not b e able to learn its native language under

some other distributions Either way whether PAC learnability is the right

analysis or not we have established the main ob jective of this chapter a child

must have a certain bias which directs the way it acquires language

Another feature of PAC learnability which may seem to o strong is the use of

the condence parameter and the error parameter Can we really set and

to arbitrarily small values and b e sure that a child will with probability at least

learn a language actually a grammar for that language which has error

less than compared to the target language Again mayb e not Mayb e this is

indeed to o much to ask Nevertheless it seems fair to assume that giving a child

more example sentences as well as more time to think those sentences over will

increase the probability that it learns an approximately correct language and

will decrease the numb er of errors the child makes From this I conclude that

the requirements of PAC learnability are at least right in spirit even though

the technical details of those requirements might b e somewhat to o strong

Formal Analysis of Language Learnability

The PAC Setting for Language Learning

We will now tune the PAC setting of the previous chapter to language learning

Let us take as our domain X all p ossible sentences ie all nite sequences using

symb ols from some xed alphab et A language is then a concept over X and

a concept class or language class is a set of languages Note that grammars

can b e used to represent languages in the sense of Section as follows A

grammar G represents or is a name of a language L if L equals the set of

sentences generated by G the grammar G is a name of the language LG We

will call this representation the grammatical representation Note that since a

language can usually b e generated by more than one grammar most languages

will have more than one name in this representation

CHAPTER APPLICATION TO LANGUAGE LEARNING

A Conjecture

In Section we have made the assumption that all children have the same

algorithm for learning their native language from input sentences Let us call

this algorithm C for Child No doubt C is extremely complex and no one

knows exactly what it lo oks like However this need not detain us here

the abstract assumption of the existence of this algorithm is sucient for our

purp oses

The algorithm C is an algorithm for learning languages from example sen

tences Since children all over the world are able to learn their native language

C must b e an algorithm that can learn at least all existing natural languages

Moreover it is well known that most children when provided with suciently

many example sentences of what will b ecome their native language learn that

language almost perfectly Thus when a child is presented with example sen

tences from some natural language it will probably learn that language approx

imately correctly I will take this as strong evidence for the conjecture that

C is a PAC learning algorithm for the class of all existing natural languages

Furthermore children learn their language quite fast and without much visible

eort usually within only a few yearschildren certainly outp erform present

day languagepro cessing computers when it comes to language learning This is

particularly fast when compared to the time and eort humans generally need

to acquire comp etence in other complex areas for instance learning mathemat

ics or learning how to play the piano to the same level of p erfection Thus it

app ears that children not only learn language probably approximately correctly

but that they do so quite eciently as well Therefore we will strengthen our

conjecture by assuming that C is an ecient ie p olynomialtime PAC learn

ing algorithm for the class of all existing natural languages Finally there is no

need to assume that the existing natural languages are the only languages that

our algorithm C can learn eciently any language suciently similar to the

existing natural languages will b e learnable by children as well Accordingly

we will make the following claim

There exists a language class L containing probably among others

all existing natural languages such that C is a p olynomialtime PAC

algorithm for L

Our aim in the following sections is to nd constraints on L using arguments

from computational learning theory Sp ecically it will b e shown that L cannot

b e the set of all contextsensitive languages

The Learnability of Typ e Languages

In this and the following sections we will investigate the learnability of the

dierent levels in the Chomsky hierarchy Actually it will turn out that without

Steven Pinkers Pin contains a very elab orate and ambitious prop osal as to the actual

learning mechanisms used

Sp ecifying what suciently similar to the existing natural languages means is more or

less the same as sp ecifying Universal Grammar

THE LEARNABILITY OF TYPE LANGUAGES

additional help eg the ability to make memb ership queries even the simplest

class in the Chomsky hierarchy the class of Typ e languages is not eciently

learnable

In fact we can dene an even simpler typ e on top of the Chomsky hierar

chy which is still not eciently learnable Recall that a language is of Typ e

or regular if it can b e generated by a grammar containing only pro ductions of

the following forms

A a

A aB

Such a grammar may b e circular or recursive in the sense that for instance

it contains pro ductions A aB B bC and C cA We can restrict

the regular languages by banning such recursion Formally this is dened as

follows

Denition Let G b e a Typ e regular grammar and N b e the set of

nonterminals in G We say there is a chain in G from A N to B N if one

of the following holds

G contains a pro duction of the form A aB

There exists a chain from A to some C N and G contains a pro duction

of the form C aB

3

Denition A Typ e grammar G with set of nonterminals N is of Type

or nonrecursive if there do es not exist a chain in G from any A N to A

A language is of Typ e if it can b e generated by a Typ e grammar 3

As the next theorem shows the class of Typ e languages is fairly simple

indeed

Theorem A language L is of Type i L is nite

Pro of

Supp ose L is generated by Typ e grammar G Note that if there is a

derivation of a sentence s a a from G then G contains pro ductions S

k

a A A a A A a A A a Then there is a chain

k k k k k

from S to any A and a chain from A to A whenever i j Now assume

i i j

L is innite Then there is a sentence s L such that s cannot b e generated

without using some pro duction A aB more than once But then there would

b e a chain from A to A contradicting the assumption that G is of Typ e

Supp ose L fs s g is nite It is easy to see that L can b e

k

generated by a Typ e grammar using a separate set of nonterminals for each

s 2

i

The reader may wonder why we have not used a more general denition

of nonrecursive languages After all a similar denition of chain and non

recursive may b e given for grammars of arbitrary typ e However it can in

fact easily b e shown that if we generalize the denition then it still holds that

CHAPTER APPLICATION TO LANGUAGE LEARNING

any nonrecursive grammar even one of Typ e can generate only a nite and

hence Typ e language Thus it is no real restriction to dene nonrecursiveness

for Typ e languages only

Alternatively we might have dened Typ e languages more generally by

limiting the numb er of recursive applications of pro ductions to some numb er

k instead of banning recursion altogether That is we might have dened

a k recursive language as a language generated by a Typ e grammar under

the constraint that for any A N a derivation of a sentence uses at most

k pro ductions that have A as lefthand side Note that this would not b e a

restriction on the grammar but on the way the grammar is used to generate

sentences But again this is no real restriction for it can b e easily shown

that such a k recursive language will b e nite and hence already in the class

of Typ e languages as formally dened ab ove

Since some regular grammars generate innite languages for instance G

fA a A aA g it follows that the class of nonrecursive languages is a

prop er subset of the class of regular languages

Unfortunately even the very simple class of all nonrecursive languages is

not eciently learnable

Lemma The class of nonrecursive languages is not of polynomial VC

dimension

Pro of Let F b e the concept class of all nonrecursive languages over some

xed alphab et which contains s characters Then the numb er of sentences of

i

length i is s and the numb er of sentences of length at most n is

n n n

s s s s s

n

F is the set of all nite sets of such sentences so

n

n s

jF j

n

This implies that jF j cannot b e upp erb ounded by a p olynomial in n Hence

by the remarks following Lemma in the last chapter we have that F is not

of p olynomial VCdimension 2

Thus using Theorem

Theorem The class of nonrecursive languages is not polynomialsample

PAC learnable

Learning languages seems to b e very hard indeed Even the class of all

nite languages is not eciently learnablethere are simply to o many such

languages In the next section we will see how memb ership queries may help

to solve this problem

THE LEARNABILITY OF TYPE LANGUAGES

The Learnability of Typ e Languages

In this section we will investigate the learnability of the class of regular Typ e

languages Firstly since the class of Typ e languages is a prop er subset of the

class of Typ e languages the negative result of the previous section carries

over immediately to Typ e languages

Corollary The class of regular languages is not polynomialsample PAC

learnable

Furthermore in Theorem of KV it is shown that the class of lan

guages which can b e recognized by a deterministic nite automaton DFA is

not eciently PAC predictable in any p olynomially evaluable representation

under a common complexity theoretic assumption for which see Section

of KV The details of such DFAs need not detain us here What is imp or

tant is that a language is regular if and only if it can b e recognized by such

a DFA HU Chapter Hence it follows that the class of regular languages

is not eciently PAC predictable if memb ership queries are not available

Since this result holds for any p olynomially evaluable representation it holds

in particular when we use grammars to represent languages

Theorem The class of regular languages is not polynomialtime PAC pre

dictable in the grammatical representation

However in the stronger setting where b oth equivalence and memb ership

queries are available the class of languages representable by DFAs is e

ciently exactly learnable Theorem of KV This result is due to An

gluin Angb Since learning with equivalence queries implies PAC learning

see Section and a DFA can easily b e converted into a regular grammar

we have the following result

Theorem The class of regular languages is polynomialtime PAC learnable

with membership queries in the grammatical representation

Let us take a step back to the real world for a moment After all we are

analyzing the learnability of languages by human b eings in particular by young

children What would memb ership queries b e for a child A memb ership query

is the question whether some particular sentence is a memb er of the target lan

guage To the extent that a child can ask questions like Mummy is this a go o d

sentence we can assume it has access to memb ership queries also assuming

of course that mummy gives correct answers Moreover young children often

implicitly test sentences by saying something to see what happ ens and to

see how its parents react Such tests may also b e seen as a kind of memb er

ship queries Now it is clear that in the initial phase of language learning a

child cannot ask such questions since the ability to even pose those questions

For contextfree grammars the grammatical representation is p olynomially evaluable

the problem whether the language generated by some contextfree grammar contains some

sentence is solvable in p olynomial time HU pp

CHAPTER APPLICATION TO LANGUAGE LEARNING

or pronounce the testsentences already presupp oses at least some linguistic

comp etence On the other hand it seems fair to assume that in more advanced

stages of language learning the child is able to ask such questions In sum

we may assume that memb ership queries are not available to the child early in

the learning pro cess but are available as so on as the child has acquired some

of its native language Thus the childs algorithm C might b e an ecient PAC

algorithm for the class of regular languages

The Learnability of Typ e Languages

Of What Typ e Are Natural Languages

In the last section we saw that C might b e an ecient PAC algorithm for

the class of regular languagesthere are no computational barriers for this if

we allow memb ership queries But however this may b e it should b e clear

that the regular languages are far to o simple full natural languages are much

Where in the Chomsky hierarchy should we lo ok more complex than that

for natural language

It app ears that most linguists would agree that a natural language can

b e describ ed by a contextsensitive Typ e grammar As Allp ort All

p writes one is not asserting anything particularly remarkable if

one claims that all the structures of natural language can b e describ ed by a

contextsensitive grammar grammars with great formal p ower can describ e a

vast variety of structures and so it is unsurprising if natural language structures

are all memb ers of such an unrestricted set Whether or not natural languages

can b e describ ed by contextfree Typ e grammars seems to b e a matter of

dispute On the one hand we have Postal who claims to prove that the Mo

hawk language is not contextfree and who provides fairly strong arguments for

the claim that the English language is not contextfree either Pos Brandt

Corstius disagrees with Postals pro of but provides his own pro of in Dutch

that Dutch is not contextfree BC Stelling His idea applies to English

as well Consider the sentence scheme These physicists philosophers from

resp ectively the US Holland are resp ectively sup ersmart smart

The p oint is that an instance of this scheme is grammatical only if the se

quences lled in on the three dotted parts agree in the numb er of terms For

Certain rash claims by the sup ervisor of the present thesis notwithstanding in Lok it

is claimed that human b eings are deterministic nite automata which suggests that human

natural languages are regular Typ e This claim is based on the obviously true assumption

that humans have a limited pro cessing capacity lifetime and memory and hence cannot

comprehend sentences of more than say one billion words In fact assuming such a maximal

length of sentences renders natural languages nite and hence only of Typ e Moreover such

a lengthb ound would make the class of p ossible languages learnable as well see Section

However we should distinguish b etween the sentences human b eings can actually compre

hend or pro cess which is part of performance and those that they would consider gram

matical part of competence It might well b e that the set of sentences any human b eing

can pro cess is nite and hence of Typ e Still most native English sp eakers would consider

the innite class of resp ectivelysentences considered b elow to b e acceptable we can query

the whole set in a single question to a native sp eaker Do you consider all such sentences

acceptable which would make natural language at least contextsensitive

THE LEARNABILITY OF TYPE LANGUAGES

instance These physicists philosophers so ciologists from resp ectively the

US Holland Belgium are resp ectively sup ersmart smart not to o dumb is

grammatical but These physicists philosophers from resp ectively the US

Holland and Belgium are resp ectively sup ersmart smart not to o dumb

and totally silly is not It can b e shown that a language containing such a

n n n

fragment is not contextfree abstractly the language fa b c j n g is not

contextfree

On the other hand more recently Gazdar et al GKPS have conjec

tured that English can in fact b e describ ed by socalled generalized phrase

structure grammars which are actually equivalent to contextfree languages

a language can b e generated by a generalized phrase structure grammar i

it can b e generated by a contextfree grammar Thus it is not quite clear if

we need contextsensitive Typ e grammars for natural language or whether

contextfree Typ e grammars suce

However this may b e it is clear that we have to investigate the learnability

of languages more complex than Typ e In this section we lo ok at Typ e

languages in the next at Typ e

A Negative Result

Because the class of Typ e languages is a sup erset of the class of Typ e

languages Theorem immediately carries over to Typ e

Corollary The class of contextfree Type languages is not polynomial

time PAC predictable in the grammatical representation

What happ ens if we allow memb ership queries Will this make these classes

eciently learnable To my knowledge no answer to this question has app eared

in the literature and neither have I b een able to prove it myself However my

conjecture would b e that the class of contextfree languages is not p olynomial

time PAC predictable even given an oracle for memb ership queries

k Bounded ContextFree Languages Are Learnable

Angluin Anga has proved a p ositive result for socalled k bounded context

free languages A contextfree grammar is k bounded if each of its pro ductions

has at most k nonterminals and any numb er of terminals in its righthand

side A contextfree language is k b ounded if it can b e generated by a k b ounded

contextfree grammar For example the toy grammar from Section is

b ounded

Angluins result assumes that there is not only a target language L but

also a particular target grammar G for that language The result dep ends on

the presence of an oracle for nonterminal memb ership queries Such an oracle

takes a string x and a nonterminal A as input and anwers yes if x can

b e generated from the pro ductions in G using A as starting symb ol and no

otherwise Note that an ordinary memb ership query for the target language is a

nonterminal memb ership query with AS For xed k the class of k b ounded

CHAPTER APPLICATION TO LANGUAGE LEARNING

contextfree languages is p olynomialtime identiable from equivalence and non

terminal memb ership queries and hence p olynomialtime PAC learnable from

nonterminal memb ership queries alone

Angluins result dep ends on a xed k in order to function prop erly the

algorithm that learns k b ounded contextfree languages has to know in advance

what k is Actually since any contextfree grammar can b e put in Chomsky

normal form HU pp where each pro duction has the form A a

or A B C any contextfree language is b ounded Thus there exists a

p olynomialtime PAC algorithm for the class of all contextfree languages if

and this is a very unrealistic if the target grammar is always in Chomsky

normal form and the algorithm can make nonterminal memb ership queries

with regard to this grammar

How realistic are nonterminal memb ership queries from the p oint of view

of a young child In the previous section we saw that ordinary memb ership

queries corresp ond to questions like Mummy is this a go o d sentence Non

terminal memb ership queries are similar except that the child may now p ose

questions ab out any grammatical category That is it may p ose questions like

Mummy is this a noun Is this a prep ositional phrase Is this an aux

iliary verb etc The ability to p ose sensible questions ab out nouns prep o

sitional phrases and what not presupp oses quite sophisticated grammatical

knowledge on the part of the child and the ability to answer such questions

presupp oses even more sophisticated grammatical knowledge on the part of

mummy Therefore it seems to b e rather unrealistic to attribute the ability to

make such nonterminal memb ership queries to young children

Simple Deterministic Languages are Learnable

Another p ositive result for a subset of the contextfree languages has b een

established by Ishizaka It is known that any contextfree language can b e

generated by a contextfree grammar in Greibach normal form Here each

pro duction has the form A a N where a is a terminal and N is a string

of zero or more nonterminals HU Theorem A simple deterministic

grammar SDG G is a grammar in Greibach normal form such that for any

terminal a and any nonterminal A G contains at most one pro duction of the

form A a N A simple deterministic language SDL is a language generated

by an SDG The class of SDLs is a prop er subset of the class of contextfree

languages and prop erly includes the class of regular languages

Ishizaka Ish provides a p olynomialtime algorithm which exactly iden

ties any SDL from memb ership queries and extended equivalence queries

However it should b e noted that even though the grammar that Ishizakas al

gorithm learns do es indeed generate the target SDL that grammar will not

always b e an SDG

When we are learning a class of languages L with an asso ciated class of grammars G

representing those languages an ordinary equivalence query may only query the correctness

of a grammar from G An extended equivalence query may query the correctness of any grammar

THE LEARNABILITY OF TYPE LANGUAGES

The Learnability of Typ e Languages

Here we will lo ok into the learnability of the class of contextsensitive Typ e

languages where negative results ab ound Firstly as b efore Theorem car

ries over immediately to lower typ es

Corollary The class of contextsensitive Type languages is not poly

nomialtime PAC predictable in the grammatical representation

Furthermore in this case memb ership queries will no longer help us The

reason for this is actually quite simple Ecient learning can only b e done in a

p olynomially evaluable representation and for contextsensitive languages the

grammatical representation is not p olynomially evaluable in general the prob

lem of deciding whether a language generated by a contextsensitive grammar

contains a particular sentence is NP complete and therefore in all likeliho o d

not solvable in p olynomial time Thus we have the following result

Theorem If P NP then the class of contextsensitive languages is not

polynomialtime PAC predictable with membership queries in the grammatical

representation

This result may b e strengthened somewhat since even for certain restricted

kinds of contextsensitive grammars the problem of deciding whether a a gram

mar generates some sentence remains NP complete In particular Aarts proves

this for socalled acyclic contextsensitive grammars which lie prop erly b etween

contextsensitive Typ e and contextfree Typ e grammars For the details

of such acyclic grammars and the pro of of the NP completeness result see

Chapter of Aar Aarts result implies that the class of acyclic context

sensitive languages is not p olynomialtime PAC predictable with memb ership

queries in the grammatical representation either assuming P NP

Finite Classes Are Learnable

Finally let us end our investigations of formal language learnability with a

p ositive result

Very briey and informally P is the class of all problems solvable in p olynomial time

more precisely solvable in time p olynomial in the size of the problem instance and NP is

the class of all problems for which the correctness of a solution can b e veried in p olynomial

time A problem is NP complete if is a memb er of the class NP and if every other

problem in NP can b e translated to in p olynomial time A particular NP complete

problem is p olynomially solvable i al l NP complete problems are p olynomially solvable It

is known that P NP and if P NP then the NP complete problems are not solvable

in p olynomial time The inequality of P and NP has b een and still is the main op en

question in complexity theory but it is conjectured by virtually everyone that the inequality

holds It is in fact a common working assumption that the inequality holds and hence that

the NP complete problems are not solvable in p olynomial time It would have momentous

consequences for instance on encryption metho ds if this turned out otherwise

See GJ for an intro duction into NP completeness and p of that b o ok for the

particular NP completeness result ab out contextsensitive grammars mentioned here

CHAPTER APPLICATION TO LANGUAGE LEARNING

Theorem Any nite class of Type languages is polynomialtime PAC

learnable in the grammatical representation

Pro of Let F fL L g b e a nite class of Typ e languages and let

k

G G b e Typ e grammars for those languages resp ectively This class is

k

easily seen to b e identiable from at most k equivalence queries an algorithm

that makes an equivalence query for each G ignoring the returned counterex

i

amples will do the job As so on as the oracle answers yes on some G we

i

have identied L as the target language Since k is xed it follows that F is

i

p olynomialtime identiable from equivalence queries This implies that F is

p olynomialtime PAC learnable as well 2

In some resp ects this is a very interesting result b ecause if we ignore vari

ation in the lexicon then Chomskys current principlesandparameters theory

only allows a nite numb er of distinct natural languages there are only nitely

many parameters each of which can take on only a nite numb er of distinct

values so the numb er of allowed sets of principles is nite see Cho p

and Cho p If we could somehow limit the set of allowed lexicons

it would follow that the class of natural languages is p olynomialtime PAC

learnable

Note however that the p olynomialtime PAC learning algorithm for F

fL L g must know in advance grammars G G that generate the

k k

languages in F Thus if the childs algorithm C is such an algorithm for a nite

class of languages the main claim of this chapter is vindicated human language

learning needs bias in this case preknowledge of the k p ossible languages

Note also that putting an upp er b ound on the length of sentences makes

the set of p ossible sentences nite which in turn makes the set of all p ossible

languages nite and hence p olynomialtime PAC learnable There may actually

b e some biological truth in an upp er b ound on the length of sentences that can

b e pro cessed since any sentencepro cessing human b eing will have a limited

memory and lifetime see also fo otnote However the p ositive learnability

result in case of such a lengthlimitation is an artefact of our denitions rather

than a p ositive result ab out learnability in practice The result follows from the

fact that a constantb ounded numb er of computational steps is p olynomially

b ounded and therefore considered ecient by our denitions no matter how

large the constant b ound actually is

For instance if we use the simple binary alphab et f g and limit the length

of a sentence to a characters there are sentences and hence

p ossible languages This class is learnable according to our deni

tions b ecause an algorithm needs at most a constantand hence obviously

p olynomiallyb oundednumb er of equivalence queries to identify a language

may b e consid from this set Unfortunately this constant

ered innite for all practical purp oses since it is slightly larger than the numb er

of particles in the universe

WHAT DOES ALL THIS MEAN

What Do es All This Mean

Let us now take sto ck We have seen a whole bunch of formal theorems ab out

the learnability and nonlearnability of various classes of formal languages

What do es all this mean for language acquisition by children Well it fol

lows from the previous results that whatever the childs learning algorithm C

is it cannot b e an ecient learning algorithm for all contextsensitive languages

not even for all acyclic ones let alone for all p ossible Typ e languages Ac

cordingly the class L of languages mentioned in the conjecture of Section

must necessarily be a very restricted class of languages Furthermore learning

algorithms for such restricted classes will only work if they know in advance

what they are lo oking for that is if they know the class of languages that the

target language comes from In other words our algorithm C needs to know

in advance which class of languages are p ossible target languagesgeneral

purpose learning mechanisms which do not have such preknowledge will not

work This means that children must have a certain linguistic bias which enables

them to learn certain languages eciently

This bias must largely b e present in the child at the age when language

learning commences Where do es it come from There seem to b e only two

p ossible sources for this bias it can he hardwired in the brain of the new

b orn child and hence innate or it can b e learned in say the rst year of

the childs life b efore the pro cess of language learning starts Probably b oth

sources contribute something to linguistic bias However since language do es

not play a large role in the environment of a very young child it seems unlikely

that the second noninnate source of bias contributes very much After all

rattles and mothers milk have very little to do with passivizing transformations

and sundry lexical features From this I conclude that the linguistic bias that

children must have is probably for a very large part innateas Chomsky has

argued all along

Let us call the languages in L the Natural Languages this includes all

existing natural languages and let us call the set G of grammars that generate

the languages in L the Natural Grammars Since L cannot contain all p ossible

Typ e languages G cannot contain all p ossible grammars What exactly are

the characteristics of the grammars that G does contain As the quotation from

Chomsky on p indicated this question is an empirical one which cannot b e

fully answered by the mathematical approach of the present thesis By formal

means we can nd restrictions on G such as that it cannot contain all context

sensitive grammars but these formal to ols will not tell us which grammars G

actually does contain

Computational learning theory may provide suggestions as to the class L

but not pro ofs ab out its contents For example Angluins p ositive result for

k b ounded contextfree grammars might suggest the p ossibility that G is the set

of all k b ounded contextfree grammars for some xed k assuming the child

can somehow make nonterminal memb ership queries If natural languages

Actually it has b een suggested that language learning already commences in utero which

would rule out this second p ossibility

CHAPTER APPLICATION TO LANGUAGE LEARNING

are contextfree as some linguists have argued then this is a signicant result

Unfortunately the formal approach used here will not tell us which xed k is

involved here This k can only b e determined from empirical work write

contextfree grammars for all existing natural languages and see by which k

they are all b ounded Similarly the result of the previous section might suggest

that G is some nite set but no merely mathematical work will tell us which

nite set of grammars G actually is

The restriction of our language acquisition pro cedures ie algorithm C

to languages with some Natural Grammar enables human b eings to acquire

their native language quite eciently However there is one disadvantage with

p ossibly farreaching consequences Namely learning a language that do es not

conform to Natural Grammar might b e exceedingly hard for us humans This

suggests that languages of b eings whose hardwiring or cognitive structure

is quite dierent from our own eg martians and p ossibly some animals

would not b e learnable for usand hence we would not b e able to understand

and communicate with those b eings

Wider Learning Issues

The previous sections applied computational learning theory to the acquisition

of languages by children To sum up the argument

The class of natural languages must b e eciently learnable b ecause most

children successfully acquire one or more languages

Computational learning theory shows that only very restricted classes of

grammars are eciently learnable In particular the classes of all Typ e

or even Typ e languages are not eciently learnable

Hence the class of natural languages cannot b e the class of all languages

it must b e some very restricted class

Children must have some preknowledge or bias of these restrictions

up on the class of natural languages in order to b e able to learn This bias

is probably largely innate

However language learning is just one example where learning from exam

ples takes place Young children are exceedingly go o d at learning many other

things b esides language as well such as learning to recognize faces learning how

ob jects usually fall how p eople walk which sp ecies of animals are dangerous

and so on Thus far not much research has b een devoted to the PAC learnabil

ity of faces or pictures or of the b ehaviour of everyday ob jects However a

Since any contextfree language can b e generated by a b ounded grammar in Chomsky

normal form we might argue that k However this would make the assumption that

children can make nonterminal memb ership queries even more unrealistic since grammars in

Chomsky normal form lo ok very unnatural It is not very plausible to assume that parents

can answer nonmemb ership queries for a natural contextfree grammar of their language

involving familiar categories like nouns and verbs let alone if this grammar is transformed

into an articial Chomsky normal form

This also sheds new light on Wittgensteins famous dictum Wenn ein Lowe sprechen

konnte wir konnten ihn nicht verstehen Wit I Ixi p if the lions language do es not

conform to our human Natural Grammar we are probably not able to learn it

SUMMARY

numb er of general negative results on the learnability of formulas from prop osi

tional logic b o olean functions have app eared Since prop ositional logic is a

relatively simple system we may also exp ect many negative learnability results

for the kinds of learning that children engage in every day On the other hand

children achieve these learning tasks quite eciently and eortlessly Thus we

are again led to an explanation in terms of innate structures apparently chil

dren are b orn with a certain bias or preknowledge knowledge here taken in

a very broad sense that helps them to learn to deal with the kinds of ob jects

animals and humans that they are likely to encounter in the early phases of

their lives

There are in fact many empirical results that p oint in this direction see

for example Steven Pinkers discussion of the innateness of intuitive mechan

ics and intuitive biology Pin pp In Pinkers words We

all get away with induction b ecause we are not op enminded logicians but

happily blinkered humans innately constrained to make only certain kinds of

guessesthe probably correct kindsab out how the world and its o ccupants

work Pin pp

Summary

The main conclusion of this chapter without a strong linguistic bias children

would not b e able to learn a language from examples eciently Since it is

evident that they do learn their native language quite fast and approximately

correctly it follows that children must have such a bias Because it is not very

plausible to assume that a child acquires this bias in the short p erio d b efore

the pro cess of language learning starts it is probably for a very large part

innate This vindicates one of the main tenets of Chomskyan linguistics Similar

arguments apply to many other kinds of learning that p eople in generaland

young children in particularengage in

See for instance KV Theorem

CHAPTER APPLICATION TO LANGUAGE LEARNING

Chapter

Kolmogorov Complexity and

Simplicity

Intro duction

There is an interesting ab out words that do or do not apply to them

selves Grellings paradox Let us call a word that applies to itself or that

is true of itself autological Examples are English which is itself an English

word and old a word which has b een in use for many centuries now Call

a word that do es not apply to itself heterological such as the clearly nonred

word red or the nonGerman word German Now the question is is the

word heterological itself heterological If it is then it isnt if it isnt then it

isa puzzling paradox indeed

A clear example of a heterological word is simplicity which is exceedingly

complex and hard to explain Bun Of course we can teach this notion

to someone that is to a human b eing with biases similar to ours simply by

giving some examples which will usually suce in practice but it is very hard

to state explicitly and generally what simplicity amounts to Since the notion

of simplicity is a rather imp ortant one in philosophy we are obliged to devote

a lot of eort at making it more p erspicious What the problem of simplicity

needs is a lot of hard work Go oc p Fortunately most of the really

hard work has already b een done for us in mathematics and computer science

though most philosophers app ear to b e unacquainted with this work The

fundamental measure of complexity or simplicity that has b een develop ed is

called Kolmogorov complexity and is the topic of the present chapter

The basic question to start with is

Can we ob jectively measure the complexity of an ob ject

A rst stab at an answer might b e that the complexity of an ob ject is prop or

tonial to the numb er of its parts This would require us to b e able to identify

and count the parts of an ob ject However what we recognize as a part of an

ob ject is relative to our interests Thus a car driver would describ e a car as

consisting of four tires a steeringwheel windows etc while a physicist would

CHAPTER KOLMOGOROV COMPLEXITY AND SIMPLICITY

consider it to b e built up from particles like protons neutrons and electrons

Moreover it would b e rather p ointless to call the physicists description the

more fundamental or b etter description in general since this physical de

scription will b e quite useless to the ordinary car driver Therefore the level of

descriptions is a more suitable level of analysis of complexity than the level of

ob jects Accordingly we will redirect our basic question towards the complexity

of descriptions or more generally of strings of characters

Can we ob jectively measure the complexity of a string

At rst sight the complexity of a string app ears to b e simply its length ie

the numb er of charactertokens it contains which is not very interesting from a

philosophical p oint of view However from simple examples it can already b e

seen that strings of equal length may diverge widely in complexity Consider

x which is a string of s and x which gives the results of

random coin ips where the ith character of x is if the ith coin ip comes

up heads and the ith character is in case of tails The string x despite

b eing characters in length is actually fairly simple the string

s fully describ es x using only characters On the other hand as each coin

ip is indep endent of the others the shortest description of x will probably b e

x itself Thus the complexity of x is much lower than the complexity of x

The thing is of course that a string can b e represented in many ways

and very regular or simple strings can b e represented very economically

Thus as we saw the short string s can represent the long string

x In this vein we could identify the complexity of a string with the length

of its shortest representation However the notion of a representation still

requires clarication What do es it mean for one string to represent another

Clearly innitely many representationschemes are p ossible We could simply

write down a representation explicitly as a twocolumn table where the strings

in the rst column linebyline represent the strings in the second column This

table however will grow to innite length if we want to b e able to represent

an innite numb er of strings

Ideally a string would itself give something like a recip e to generate the

string it represents this would allow us to disp el with the twocolumn table

For example the string s tells us that putting s in a sequence

gives us the string x that it represents Now the most basic idea of a recip e

that we have is the notion of an algorithm and every algorithm is a Turing

machine and every Turing machine can b e enco ded as a binary string This

gives us the following explication of what it means for one string to represent

another

String y represents string x if y is the enco ding of a Turing machine

that generates x and then halts

Now we can identify the Kolmogorov complexity of a string x with its shortest

representation

Compare Wit x

Similarly Bunge Bun directs his attention at what he calls semiotic simplicity not at

ontological simplicity

DEFINITION AND PROPERTIES

The Kolmogorov complexity of a string x is the length of a shortest

y such that y represents x

We use the phrase a shortest y rather than the shortest y since in general

there may b e several distinct Turing machines of the same length that each

pro duce x

The main aim of this chapter is to extract from theoften highly technical

literature on Kolmogorov complexity those asp ects which are of interest to phi

losophy Except for the selection of topics and their presentation no originality

is claimed here The chapter is organized as follows We start with a precise

denition of Kolmogorov complexity in the next section and state some of its

main prop erties notably its ob jectivity up to a constant its noncomputability

and its relation to information theory After that we will discuss various philo

sophical issues where Kolmogorov complexity is relevant The main application

as the title of this chapter already indicated lies in a formalization of the notion

of simplicity which is omnipresent in philosophy in general and in the philos

ophy of science in particular Secondly Kolmogorov complexity also allows

us to clarify the notion of randomness which will b e taken up in Section

Thirdly and nally the denition of Kolmogorov complexity allows us to give

a pro of of Godels fundamental incompleteness theorem which do es not make

us of selfreferring sentences

As the reader will notice the present chapter contains virtually nothing

on learning theory despite the title of this thesis However one of the most

imp ortant applications of Kolmogorov complexity lies in inductive learning

We will defer this till the next chapter There Kolmogorov complexity will b e

used to formalize Occams Razor which says that simple hyp otheses are to b e

preferred over more complex ones

Denition and Prop erties

In this section we will dene Kolmogorov complexity and state some of its main

prop erties The idea to dene the complexity of a string as the length of a

shortest Turing machine that pro duces the string was develop ed indep endently

and for dierent purp oses in the s by three dierent p ersons

Ray Solomono Sol used it in order to dene a universal probability

distribution which can b e used for prediction

Andrei Kolmogorov Kol Kol primarily intro duced the complexity

measure named after him in order to study randomness

Gregory Chaitin Cha Cha dened Kolmogorov complexity for study

ing complexity as well as randomness

Though SolomonoKolmogorovChaitin complexity might b e somewhat more

appropriate the name Kolmogorov complexity app ears to have stuck

CHAPTER KOLMOGOROV COMPLEXITY AND SIMPLICITY

Turing Machines and Computability

In the intro duction we came up with the following denition

The Kolmogorov complexity of a string x is the length of a shortest

y such that y represents x

Here y represents x if it enco des a Turing machine that generates x

In the next subsection we will make this denition precise Here we will rst

explain in some more detail what a Turing machine is what it do es and what it

can do Turing machines were intro duced by Alan Turing Tur Informally

a Turing machine consists of a table of instructions which describ e how the

machine manipulates the symb ols on one or more innite tapes of cells Each

instruction tells what the machine should do given the particular state it is

currently in and the contents of the tap e cell it is currently scanning Given

a state and tap e contents the instruction tells the machine which symb ol to

write in its current cell in which direction to move its tap e head and in which

state to go next All the machine do es is follow these instructions stepbystep

until if ever it reaches a p oint where none of its instructions is applicable and

then it halts

We will restrict the alphab et of symb ols allowed on the tap es to and

blank Since anything that can b e stated in some language can b e enco ded

in binary this is not a real restriction In particular we can set up a corre

sp ondence b etween the natural numb ers and binary strings for instance the

following

Here is the empty string Note that the binary representation of a numb er n

has approximately length log n bits here we use logarithms with base We

will b e a bit informal ab out the distinction b etween numb ers and the corre

sp onding binary strings switching back and forth whenever this is convenient

when we sp eak of some numb er x it will b e clear from the context whether we

mean that numb er itself or the corresp onding binary string

A Turing machine T computes a function f from the natural numb ers to

the natural numb ers as follows Supp ose we start executing T in some initial

state with an initial tap e that contains only one binary string corresp onding

to the natural numb er n If T s execution terminates with some natural numb er

m in binary on its tap e we dene f n m otherwise f n is undened A

function is called total if it is dened on each element of its domain so f is a

total function if T halts on all natural numb ers

Denition A function f from N to N that is computed by some Turing

machine T is called partial ly recursive or computable If f is total and T halts

on all inputs then f is called total recursive or recursive 3

By the ChurchTuring thesis any intuitively mechanically computable

function is partially recursive

DEFINITION AND PROPERTIES

Denition A set A of natural numb ers is recursive or decidable if there is

a recursive function f such that f n if n A and f n if n A

A is recursively enumerable if there is a partial recursive function f such

that f n if n A and f n or f n is undened if n A 3

The intuition b ehind the latter notion is that a set A is recursively enumer

able if there is a Turing machine which outputs a p ossibly innite sequence

containing all and only memb ers of A We can also dene recursiveness and

recursive enumerability for sets of other ob jects as long as we can enco de these

as natural numb ers For example the set of Turing machines that halt is not

recursive there is no algorithm that takes a binary enco ding of an arbitrary

Turing machine as input and determines whether this machine halts after a

nite numb er of steps This is the wellknown undecidability of the halting

problem due to Turing

We can also compute functions that have two or more natural numb ers as

input andor two or more numb ers as output by letting the initial or nal

tap e contain two or more natural numb ers separated in some suitable way An

ntuple of numb ers will b e denoted by hx x i Using functions with two

n

natural numb ers as output we can also dene functions that range over the

set of rational numb ers Q a partial recursive function f from N to N can

also b e seen as a partial recursive function g from N to Q where g n pq

for f n hp q i Using this we can also computeor at least approximate

functions ranging over the set of real numb ers R

Denition A function f from N to R is called enumerable if there is a

recursive function g from N to Q such that g x k g x k for all x k

and lim g x k f x g approximates f from b elow Analogously the

k

function f is coenumerable if it can b e approximated from ab ove Finally f is

recursive if there is a recursive g from N to Q such that jf x g x k j k

for every x and k 3

Let T T b e a list of all Turing machines Each of these computes a

partial recursive function so there is a corresp onding list f f of all and

only partial recursive functions A very fundamental concept is the universal

Turing machine This is a Turing machine U that can simulate all other

Turing machines for every i there exists a numb er or binary string t such

i

that given inputs t and a numb er n U computes the same function as T on

i i

n ie U t n f n for all n Such a t can b e called an encoded Turing

i i i

machine or a program for U It is a fundamental result that such universal

Turing machines can actually b e constructed there are in fact innitely many

of them If T is a Turing machine and t is its shortest binary enco ding

i i

relative to U then the length of T relative to U will b e l T l t the

i i i

string length of t

i

Denition

Let us x some particular universal Turing machine T and let f b e the

u u

function from N to N that T computes If f y x then T pro duces x when

u i u

CHAPTER KOLMOGOROV COMPLEXITY AND SIMPLICITY

its initial input tap e contains program t and y ie f t y x If y is empty

i u i

we simply write f t x Then we can dene the Kolmogorov complexity

u i

of a string x as the length of a shortest Turing machine that computes x

K x minfl t j f t xg However for technical reasons we have to make

u

one imp ortant addition namely that we enco de the Turing machines in a prex

free way That is if t and t are enco dings of Turing machines programs for

T then t is a prex of t only if t t x is a prex of y if y xz for some

u

z For instance it cannot b e the case that and are b oth enco dings of

Turing machines since the former is a prex of the latter With this condition

in place we can dene

Denition Let T b e some xed universal Turing machine whose set of

u

programs is prexfree and let f b e the function from N to N that T com

u u

putes The Kolmogorov complexity of a binary string x is

K x minfl t j f t xg

u

3

Thus as promised the Kolmogorov complexity of a string x is indeed the length

of a shortest Turing machine that computes x starting from an initially empty

tap e The Kolmogorov complexity of a natural numb er n is the Kolmogorov

complexity of the binary representation of n

It is fairly easy to show that there exists a constant c such that for every x

K x l x c Informally a simple computer program like print x suces

to generate x The length of this program will b e the length of print which

is a constant indep endent of x plus the length of x Thus the Kolmogorov

complexity of a string cannot b e much larger than its own length This is how

it should b e since a string is a complete description of itself

Ob jectivity up to a Constant

In the intro duction to this chapter we claimed Kolmogorov complexity to b e

ob jective But do esnt it dep end on the particular universal Turing machine

T we use Would not a dierent choice of T lead to dierent complexities

u u

Indeed it would But still Kolmogorov complexity can b e called ob jective due

to the following Invariance Theorem

Theorem Invariance Let T and T be universal Turing machines let

u v

K x denote the Kolmogorov complexity of x relative to T and K x be

u u v

the Kolmogorov complexity of x relative to T Then there exists a constant c

v

depending on u and v but not on x such that for every x

K x K x c

u v

There also exists a version of Kolmogorov complexity without this requirement This

C x discussed in Chapter of LV has some undesirable prop erties which make it less

interesting than the prexfree version

DEFINITION AND PROPERTIES

This result is a fairly obvious consequence of the fact that any two universal

Turing machines can simulate each other Let T and T b e two universal Turing

u v

machines Since T is a universal Turing machine there exists a T program

u u

s which computes the same function as T This s can b e called a simulation

v

of T on T Supp ose T with enco ding t is a shortest Turing machine that

v u

computes x relative to T so K x l t If we want to compute x relative to

v v

T then we can take the simulationprogram s feed t into it and the program

u

s will execute t for us and hence generate x on T Thus K x is at most

u u

K x l t plus a constant c which accounts for the length of the simulation

v

program s and some overhead

The theorem implies that there is a constant c such that for every x

jK x K xj c

u v

Thus though Kolmogorov complexity dep ends on the particular universal Tur

ing machine T we cho ose for larger x and y the relative inuence of the choice

u

of T b ecomes negligible For instance if c and we are dealing with

u

ob jects of Kolmogorov complexity more than then the relative dif

ference jK x K xjK x is less than This makes Kolmogorov

u v u

complexity suciently ob jective

NonComputability

In general we can discern two distinct desirable goals in formal analysis Firstly

it should provide us with a more clear insight in the analyzed topic We will see

in the next section how Kolmogorov complexity allows us to supply clear and

precise meanings to notions like simplicity and randomness The second goal

of formal analysis is to provide us with useful practically applicable tools In

order for Kolmogorov complexity to b e fully applicable we should b e able to

nd out what the Kolmogorov complexity of a given string is Unfortunately

this is b eyond us at least b eyond algorithmic means Kolmogorov complexity

is not computable For a pro of we refer to Theorem of LV

Theorem Noncomputability The function K x is not recursive

However we are able to approximate K x There exists a particular recur

sive function g x k such that if we successively compute g x g x g x

then the sequence of numb ers we obtain will converge to K x from ab ove

For instance given x and k g might simulate the rst k steps of the rst k

Turing machines and output the length of a shortest of these k machines that

pro duces x if none of the rst k machines pro duces x let g b e some huge prac

tically innite numb er It is clear that g x k decreases when k grows Fur

thermore if the ith Turing machine is a shortest Turing machine that pro duces

x say after j steps then g x max fi j g K x so lim g x k K x

k

Theorem Approximation The function K x is coenumerable

This result makes approximate application of Kolmogorov complexity at

least p ossible in principle of course computability of an approximation do es

not imply ecient or practical computability of an approximation

CHAPTER KOLMOGOROV COMPLEXITY AND SIMPLICITY

Relation to Shannons Information Theory

The complexity K x of x may b e seen as a measure of the amount of in

formation inherent in x since we need a program of at least K x bits to

reconstruct x In this subsection we will see how Kolmogorov complexity re

lates to the older and more famous denition of information given by Claude

Shannon Sha CT We will not use this in the remainder of the thesis

so the reader may wish to skip this section on a rst reading

Briey in Shannons framework a message is a nite sequence of words We

will assume each word is a binary string Each of the p ossible words w w

k

has a denite probability frequency of app earing Let P b e the probability

distribution over these words The entropy of P is dened by

k

X

H P Pw log Pw

i i

i

For example consider a simple language of three words and

with Pprobabilities and resp ectively A message in this

language is simply a nite sequence of these words drawn according to P The

entropy is

log log log bits H P

H P is the information content in bits of a message of one word A message

of n words then contains nH P bits of information

We can enco de messages in this language by assigning each word w a par

i

ticular codeword c another binary string and by enco ding each sequence of

i

words as the corresp onding sequence of co dewords this can b e done in such

a way that a bitstring which is a sequence of co dewords can always uniquely

b e decomp osed into those co dewords again If we do this in a smart way

assigning short co dewords to highprobability words we can achieve a much

more ecient and economical representation than if we represent words simply

by themselves By a fundamental theorem of Shannons H P is an almost

reachable lower b ound for the length of binary enco dings of messages we can

assign each word a co deword in such a way that if we draw a word according to

P then the exp ected length of its co deword equals H P to within one bit see

CT Theorem or LV Theorem Thus for an optimal en

co ding of the language ab ove the exp ected co delength of an arbitrary message

of n words is approximately n bits whereas if we use each of the three words

simply as its own co deword the exp ected co delength is

n

Now sequences of words are simply binary strings and hence have a

Kolmogorov complexity If we draw a word according to P then the exp ected

P

k

Pw K w Surprisingly this Kolmogorov complexity of this word is

i i

i

exp ected Kolmogorov complexity is asymptotically equal to the entropy LV

p provided P is recursive

SIMPLICITY

Theorem If P is a recursive probability distribution over words w w

k

then

P

k

P w K w

i i

i

lim

H P

H P

What do es this mean It means that if P is suciently complex ie

H P is suciently large and we draw a word w according to P then on

average K w will b e approximately equal to H P Consequently for a long

message v v of n words we can exp ect K v v to approximately equal

n n

the information content nH P of the message and we can use the shortest

programs that generate strings as approximately optimal co dewords for those

strings Nevertheless despite the fact that the Kolmogorov complexity of a

message converges to its Shannoninformation content we agree with Cover

and Thomas CT p that Kolmogorov complexity is more fundamental

than Shannon entropy b ecause it do es not dep end on a particular probability

distribution P

Simplicity

From a philosophical p oint of view the most imp ortant contribution of the

theory of Kolmogorov complexity has b een to provide us with a precise def

inition of the simplicity of individual strings Particularly in the s and

s many unsuccessful attempts were made to measure the complexity of

theories esp ecially when formulated in rstorder logic One of the most strik

ing of these was Popp ers prop osal to identify degree of simplicity with degree

of falsiability or strength Pop p However the following example

by Go o dman Go ob p shows that simplicity can neither b e identied

with strength as Popp er wants nor with safety

All maples except p erhaps those in Eagleville are deciduous

All maples are deciduous

All maples whatso ever and all sassafras trees in Eagleville are deciduous

Clearly the second of these hyp otheses is the simplest and is preferable to the

others if consistent with the data However is stronger than while is

safer ie more likely to b e true than Thus neither the strongest nor the

safest hyp othesis need b e the simplest

Kolmogorov complexity does give us a sound and ob jective quantitative

measure for complexitysimplicity That is

A string x a description a theory etc is simple to the extent that

it has low Kolmogorov complexity

Some relations b etween Shannon information Kolmogorov complexity and information

and entropy in physics are describ ed in Chapter of LV

See Hes for an overview of some other approaches at measuring simplicity and their

problems More recently Elliott Sob er Sob has attempted to equate simplicity with in

formativeness relative to given questions a theory is more informative to the extent that it

needs less additional information in order to b e able to answer a given question

CHAPTER KOLMOGOROV COMPLEXITY AND SIMPLICITY

Not surprisingly simplicity shows up as a gradual notion here things are not

simple per se but they can b e more simple or less simple Because as we

have seen the Kolmogorov complexity of x is ob jective to within a constant

indep endent of x this denition of simplicity is ob jective as well The only

sub jectivity lies in the choice of the particular universal Turing machine we

use but the inuence of this choice b ecomes negligible for larger x Actually

many philosophers have claimed that simplicity is to o sub jective and context

dep endent to b e ob jectively denable at all For instance Lakatos writes

No doubt simplicity can always b e dened for any pair of theories

T and T in such a way that the simplicity of T is greater than

that of T Lak p note

To b e sure this also holds for Kolmogorov complexity If we want to give a

particular string x a very low Kolmogorov complexity we can achieve this by

tinkering with some ordinary universal Turing machine T in such a way that

u

it outputs x if it is given the string as input Thus the new machine would

have x somehow hardwired in its program Then relative to this new machine

we will have K x so we can tinker in such a way that any particular string

gets a very low complexity However if we cho ose some reasonable universal

Turing machine where no information ab out particular strings is hardwired

this problem will not arise and we can stick to the ob jectivity of Kolmogorov

complexity

Why should we b other ab out simplicity Because it is one of the guiding

principles of science simplicity of a theory is generally regarded as a virtue

Scientists generally follow the maxim that simple or elegant theories are to b e

favoured b oth in their practice and in their own theorizing ab out science As

Quine writes

Consciously the quest the sifting of evidence seems to b e for the

simplest story Simplicity is not a desideratum on a par with

conformity to observation Observation serves to test hyp otheses

after adoption simplicity promps their adoption for testing Still

decisive observation is commonly long delayed or imp ossible and

insofar at least simplicity is nal arbiter

Whatever simplicity is it is no casual hobby As a guide of inference

it is implicit in unconscious steps as well as half explicit in delib erate

ones The neurological mechanism of the drive for simplicity is

undoubtedly fundamental though unknown and its survival value

is overwhelming Qui pp

Or in Go o dmans words

simplication is the heart of science Science consists not of

collecting particular truths but of relating dening demonstrating

organizingin short of systematizing And to systematize is to sim

plify Science is the search for the simplest applicable theory

Go od p

SIMPLICITY

We will see more examples in the next chapter Given the imp ortance of sim

plicity clarifying what makes a simple theory simple and what makes a simple

theory favourable is an imp ortant topic in the philosophy of science

The idea that selecting simple theories is go o d however already brings us

to Occams Razor to which we will devote the whole next chapter In this

section we will ignore the prescriptive asp ects of simplicity that simplicity is

go o d restricting attention to its conceptual asp ects Most imp ortant among

these what are its relations to the notions of elegance and beauty

In general if we do not restrict attention to science what we would call

simple elegant or b eautiful can diverge widely For instance the word ele

gant often has the connotation of slightly sup ercial and hence diverges from

b eautiful Furthermore simple works of art need not b e b eautiful Whos

afraid of red yellow and blue conversely many of the most b eautiful pieces

of art are highly complex and dense with connotations On the other hand

simple works of art can b e very b eautifulexamples that come to mind are

Mondriaans abstract paintings and Saties early piano music Moreover b oth

elegance and b eauty are irreducibly sub jective whereas simplicity could to a

large extent b e made ob jective as we have seen

However when we do restrict attention to the roles of simplicity elegance

and b eauty in science particularly their roles as prop erties of scientic theo

ries things start to get interesting Many a scientist or philosopher has used

these notions in the same breath and it is not at all clear how they can b e

distinguished Ordinary usage of these terms do es not seem to provide us with

clear b oundaries On the one hand this is a nuisance but on the other it also

leaves us plenty of ro om to sp ecify these b oundaries for ourselves supplying

our own denitions I would like to prop ose the following informal denitions

A theory is simple to the extent that it can b e describ ed easily This can

satisfactorily b e formalized in terms of low Kolmogorov complexity

A theory is elegant to the extent that it is simple in the ab ove sense and

easy to handle

A theory is beautiful to the extent that it causes feelings of pleasure

delight reverence and wonder

Because of the vagueness of natural language any b oundary will b e somewhat

arbitrary Whether these particular b oundaries diverge to o far from ordinary

usage I leave for the native sp eakers of English to decide

Our earlier statement that simplicity is largely ob jective while elegance

and b eauty are sub jective is clearly in accordance with these denitions Sim

plicity can b e dened in terms of Kolmogorov complexity and hence is su

ciently ob jective On the other hand a simple theory is elegant only if it is easy

Even the complexity of works of art is amenable to analysis in terms of Kolmogorov

complexity Literature or musical scores can easily b e transformed into bitstrings which can

b e assigned a Kolmogorov complexity Similarly by treating it as a matrix of colour dots and

assigning each dot a numb er a painting can b e transformed into a bitstring Very regular

paintings will have low complexity visually very complex paintings will have high complexity

CHAPTER KOLMOGOROV COMPLEXITY AND SIMPLICITY

to handle easy to useand this dep ends of course partly on the p erson who

actually uses it a complex theory is often easy to handle for those having much

exp erience with it but dicult for rstyear students Thus elegance is partly

sub jective That b eauty is also sub jective will not surprise us some p eople feel

pleasure delight reverence and wonder very easily while others remain numb

and uninterested even when faced with the theory of relativity Despite the

sub jectivity of the elegance or b eauty of scientic theories b oth are strongly

linked to the ob jectively denable simplicity In case of elegance this is im

mediately apparent from the denition But also b eauty is strongly correlated

with simplicity we will see some examples of this in the next chapter For one

thing in order to appreciate the b eauty of some theory we should b e able to

comprehend it fullywhich can only b e if the theory is suciently simple for

us to b e comprehensible in the rst place Extremely complex theories may

sometimes b e great predictors but they will generally not b e considered very

b eautiful

The sub jectivity of b eauty in scientic theories also shows up in the vari

ance of scientists aesthetic canons over time To end this section let us briey

lo ok at James McAllisters interesting recent account of the role of aesthetical

criteria in science McA According to this account empirical criteria for the

ory choice such as predictive success and consistency with other theories are

supplemented by aesthetic criteria McAllister mentions ve classes of such aes

thetic criteria symmetry invo cation of a mo del visualizabilityabstractness

methaphysical allegiance andmost interesting for ussimplicity Aesthetic

value is pro jected onto an ob ject for instance a scientic theory according to

such aesthetic criteria or canons The content and weighing of these criteria are

not constant over time but are unconsciously inductively derived from the

prop erties of recent successful theories our sense of what is b eautiful derives

from and varies with what is successful Whenever theories are replaced by

more successful ones our aesthetic canons are to some extent adjusted as well

In McAllisters mo del a scientic revolution o ccurs if the aesthetic criteria

start lagging to o far b ehind the empirical criteria and these two sets of criteria

start coming in severe conict In this case a progressive faction of scientists

which places more value on empiricial than on aesthetic criteria sup ersedes a

conservative faction that wants to hold on to older theories they p erceive as

more b eautiful This is corrob orated by the fact that many of the truly revo

lutionary new ideas and theories such as Keplers elliptical orbits or quantum

mechanics were considered by many rather ugly early after their inception

An imp ortant dierence b etween McAllisters handling of simplicity and our own while

we ascrib e simplicity to representations of scientic theories namely bitstrings McAllis

ter ascrib es simplicity to those theories themselves not to particular representations of

them McA p The fact that we lo oking only at representations avoids the problems

asso ciated with McAllisters almost Platonic view of theories as abstract entities

Furthermore McAllister explicitly rejects using Kolmogorov complexity as the measure of

simplicity since there are alternatives for instance measuring the numb er of assumptions or

the numb er of variables in a theory McA pp Nevertheless we consider simplicity

as measured by Kolmogorov complexity to b e more fundamental than other measures b ecause

its foundation the Turing machine as a mo del of eective computability is more fundamental

and less ad hoc than others

RANDOMNESS

but are considered much more b eautiful and elegant now that we have gotten

used to them and have b een convinced of their empirical successour aesthetic

canons have b een adjusted to them

Randomness

Actually despite our fo cusing on the use of Kolmogorov complexity as a for

malization of simplicity the initial motivation of its development lay elsewhere

namely in the notion of randomness of strings From the p oint of view of philos

ophy of science it is very interesting to sp ecify randomness After all scientists

aim at nding structure regularity causes etc in the world but we cannot

rule out a priori that some domains have no regularity whatso ever Intuitively

if some domain p ossesses no regularity at all then descriptions that stem from

this domain will b e random strings Accordingly ways to recognize randomness

are imp ortant

Before the advent of Kolmogorov complexity several attempts had b een

made to dene necessary and sucient conditions for randomness for instance

by Von Mises Wald and Church However each of those denitions included

some strings as random which we intuitively would not consider random and

hence failed

Finite Random Strings

In this subsection we will characterize the prop erty of randomness of nite

binary strings in the next we will deal with innite strings It should b e

noted that when dealing with nite strings it is rather arbitrary to x a sharp

b oundary b etween random and nonrandom nite strings randomness is a

matter of degree here Furthermore strings are not random per se but random

with resp ect to a certain probability distribution For example a binary string

of length consisting of ab out equally many s and s distributed in an

irregular way over the string will b e fairly random with resp ect to a probability

distribution induced by a fair coin However it would b e rather surprising if this

string were generated by tossing an unfair coin that comes up heads of

all tosses and the string is not random with resp ect to the distribution induced

by the unfair coin

The way we will dene randomness here is via tests for randomness a

string is random if it passes certain tests For instance one test for randomness

might say that a string that contains much more s than s is not random

with resp ect to the probability distribution induced by a fair coin another test

might say that a string which starts with a s is not random with resp ect

to that distribution Thus we could dene a string as random if it passes all

conceivable tests for randomness The theory of tests for randomness describ ed

b elow shows that we can actually formalize this This framework is due to the

Swedish mathematician MartinLof ML who coop erated with Kolmogorov

See LV pp for an overview of the various approaches and why they failed

CHAPTER KOLMOGOROV COMPLEXITY AND SIMPLICITY

A string is random if it passes all conceivable tests for randomness But

what constitutes a conceivable test for randomness Firstly in order for us

to b e able to carry out such a test we should in any case b e able to compute

or approximate its outcome Secondly a test for the randomness of a string

x is a test whether x is a typical string for that distribution For instance

would b e a typical outcome for the faircoindistribution while

times heads in a row would not The typical strings form

a reasonable ma jority of all strings a ma jority on which most probability

concentrates Each test is a way of sp ecifying such a ma jority and of testing

whether a given string x b elongs to it or not A string will b e called random if

it b elongs to every reasonable ma jority sp ecied in this way ie if it passes

each test We now rst give the denition of a test and then explain what is

intended by this idea of a ma jority

Denition Let P b e a recursive probability distribution on f g the

set of nite binary strings A total function f g N is a P test or a

MartinLof test for randomness with respect to P if it satises the following

two conditions

is enumerable

P

m

fP x j x m l x ng for every m and n

3

What is going on here The rst condition is fairly plain can only b e

an implementable or eective test for randomness if we can computably

approximate it The second condition requires some more explanation The

idea here is that the elements of f g that do not b elong to some reason

able ma jority of typical strings are assigned high values by the test The

set V fx j x m l x ng singles out all strings of length n that

mn

are sp ecial in having value at least m this set forms the complement of the

reasonable ma jority of typical strings The second condition in the denition

is to ensure that V is indeed the complement of a reasonable ma jority by

mn

requiring that V b ecomes ever more more improbable for larger m equiv

mn

alently the probability concentrates on the complement of V which is to

mn

contain the typical strings

P

P V fP x j x l x ng

n

P

P V fP x j x l x ng

n

P

P V fP x j x l x ng

n

Thus if we draw an arbitrary x of length n according to P it is very unlikely

that x b elongs to V for higher m If x does b elong to V we have go o d

mn mn

reason to b elieve that it is a nontypical string and we can consider it non

random accordingly In statistical terms each V is a critical region If

mn

x V then we can reject the hyp othesis that x is random with signicance

mn

m

level Supp ose for instance that we have a string x of length n and

RANDOMNESS

we want to know whether this string is random with resp ect to a distribution

P using some particular P test Supp ose x The second condition

in the ab ove denition tells us that the probability that an arbitrary element

of length n is a memb er of V is at most which is less than and

n

hence rather improbable However our string x is a memb er of this set so

apparently it is a rather sp ecial nontypical string which do es not b elong to

the reasonable ma jority tested by we can reject the hyp othesis that x is

random with condence Thus a Ptest gives us information ab out the

typicalness of its argument x tends to b e less typical with resp ect to P if

x is higher ie x V for higher m

m

Recall that we wanted to dene a string as random if it b elongs to al l

reasonable ma jorities ie if it passes al l tests for randomness at a certain

condence level Now there are clearly innitely many dierent P tests for

any probability distribution P and it would b e rather cumb ersome to check

whether a string x passes each of those tests b efore we can assign it the predicate

random with resp ect to P Instead it would b e much more interesting to

have a single Ptest which only the random strings pass It is in fact p ossible

to combine all p ossible Ptests into a single Ptest Such a test is called

universal

Denition A Ptest is universal if for every P test there exists a

u

constant c such that for every string x x x c 3

u

A universal Ptest is such that no other Ptest can nd more than a

u

constant amount of regularity in x more than It is a fundamental result

u

that for each recursive probability distribution P there is a universal P test

Theorem Let P be a recursive probability distribution and

be an enumeration of al l P tests Then x maxf x i j i g is a

u i

universal P test

The main element in the pro of of this theorem Theorem of LV is

to show that the sequence is indeed recursively enumerable Given

that fact it is fairly easy to show that as dened here is itself a Ptest and

u

has x x i for every i Hence is indeed a universal Ptest and can

u i u

b e used to measure the degree of randomness of nite strings x is less random

if x is higher

u

If we x some universal Ptest and some constant c for instance c

then randomness of nite strings with resp ect to P can b e dened as follows

Denition Let P b e a recursive probability distribution and x a nite

binary string Fix some universal P test and constant c We call x P

u

random if x c 3

u

Innite Random Strings

In the case of nite binary strings we cannot distinguish sharply b etween ran

dom and nonrandom strings but in the case of innite strings we can In this

CHAPTER KOLMOGOROV COMPLEXITY AND SIMPLICITY

subsection we will generalize the previous denitions of tests for randomness to

the case of innite strings We use B f g to denote the set of all innite

binary strings If is some innite binary string we use to denote the nite

n

string consisting of the rst n bits of For instance if then

Because we cannot computably deal with probability distribu

tions on the set of innite strings no Turing machine can completely swallow

an innite string as input and compute its probability we have to use some

thing else We will use a function that assigns a numb er x to any

nite binary string x with the following restrictions

x x x

Here x is interpreted as the probability that a string from B starts with

x The rst restriction simply says that any string must start with the empty

string which is fairly obvious The second says that the probability that an

innite string starts with x equals the sum of the probabilities that it starts

with x or with x which is obvious as well b ecause x can only b e followed

by a or a Such a function is called a measure on B

For a recursive measure on B MartinLof s denition of a test for random

ness of innite strings can b e stated as follows

Denition Let b e a recursive measure on B f g A total function

B N fg is a sequential test or a sequential MartinLof test for

randomness if it satises the following two conditions

There exists an enumerable total function f g N such that

sup f g

n

nN

Notational remark the supremum of a set A of numb ers denoted sup A

is the maximum of A if A contains a greatest element and otherwise

m

f j mg for every m

3

Such tests are called sequential b ecause we can use the function to ap

proximate by sequentially approximating for n

n

Just as in the case of nite strings a high value for x indicates that x

is nontypical and hence nonrandom Let V fx j x mg The set V

m m

contains all strings that are identied as sp ecial or nontypical in the sense of

b eing assigned a value of m or more A string x may b e said to pass the test

T

V if it is not sp ecial for higher m x V for some m equivalently

m m

m

An innite string is random if it passes all tests in this way

Denition Let b e a recursive measure on f g and V b e the set of

all sequential tests An innite binary string is called random if it passes

T

V for every 3 all sequential tests

m

m

x is the string which is the concatenation of x and x is the concatenation of x and

RANDOMNESS

Analogous to the case of nite strings for each recursive measure on the

set of innite strings there is a universal sequential test see Theorem

of LV

Denition A sequential test is universal if for every sequential

u

test there exists a constant c such that for every innite string we

have c 3

u

Theorem Let be a recursive measure and be an enumeration

of al l sequential tests Then x supf x i j i g is a universal

u i

sequential test

It is not very dicult to see that the set of innite strings that are random

in the sense of Denition are exactly the strings for which is nite

u

Thus an alternative equivalent denition of randomness is is random i

is nite Note that if and are two distinct universal sequential

u u v

tests then j j c for some constant c and for all Hence

u v u

is nite i is nite which shows that it do es not matter which particular

v

test we use each universal sequential test picks out exactly the same set of

random strings

The ab ove paragraphs dened randomness in terms of tests for randomness

How do es all this relate to Kolmogorov complexity Kolmogorov complexity

allows us to formalize another intuition ab out randomness a string is non

random to the extent that it contains many regularities Very vaguely and

abstractly a regularity is some kind of rep etition the recurrence of a certain

pattern Now if a string contains some kind of rep etition then we can make use

of this to represent the string more eciently For instance a string of length

that consists of the string rep eated times can b e represented

by the much shorter string times In other words if a string

contains regularities we can use these to compress the string Thus a nite

string is nonrandom to the extent that it can b e compressed and is random

to the extent that it cannot b e compressed Extending this to the innite

case an innite string may b e said to b e random if all of its prexes are

incompressible or at least compressible at most a xed numb er of bits That

is basically the shortest description of its prexes are those prexes themselves

and the shortest description of the whole sequence is that sequence itself The

In the case of nite strings the denition of K x do es not enable us to state a clear

relation b etween Kolmogorov complexity and randomness However if we temp orarily drop

the condition that the set of enco dings of Turing machines should b e prexfree and dene

C xjl x as the length of a shortest Turing machine that generates x given l x on a sp ecial

input tap e we can state such a relation Dene the randomness deciency of a nite string

x to b e f x l x C xjl x This deciency is higher if x is more compressible

more regular Theorem of LV states that f x is a universal test with resp ect to

the uniform measure Hence high compressibility indicates nonrandomness and vice versa

Incidentally since pseudo random numb er generators are usually fairly short programs the

random numb ers that these programs generate will not really b e random at all

Dennett Den uses Kolmogorov complexity without mentioning this name to argue that

strings contain real patterns if those strings are compressible

CHAPTER KOLMOGOROV COMPLEXITY AND SIMPLICITY

following fundamental result LV Theorem shows this intuition to b e

correct if we consider randomness with resp ect to the uniform measure which

l x

assigns x This is a very natural measure b ecause it distributes

probability in an unbiased uniform way all distinct prexes of length n are

equally probable

Theorem An innite binary sequence is random i there is a constant

c such that K n c for every n

n

An Interesting Random String

In the previous pages we have characterized randomness for nite and innite

strings As the numb er of innite strings is uncountable whereas the numb er

of Turing machines is only countably innite it is clear that there are innitely

many innite strings that are random with resp ect to the uniform measure It

is however equally clear that we can never eectively describ e one How could

wea random innite string has no nite eective description Nevertheless

we can give noneective descriptions of random strings

A very interesting example of such a string is the real numb er This is

dened as follows where U is some xed universal Turing machine

X

l t

ft j U t halts g

It can b e shown that in binary expansion we can write

where each is a bit We use to denote the

i n n

string of the rst n bits of s binary expansion

This very abstract numb er is called the halting probability b ecause it is the

probability that when we feed a sequence of fair coin ips into the universal

Turing machine U the machine U halts This can b e seen as follows Consider

a monkey who rep eatedly ips a fair coin If the coin comes up heads we write

down a and a otherwise thus pro ducing a growing binary string As so on

as the binary string x is the enco ding of a Turing machine we let the monkey

stop ipping coins Now consider the innite sequence of all nite binary strings

that enco de halting Turing machines

x x x

l x

i

For each x the probability that the monkey pro duces x is Since

i i

the x x sequence is prexfree these events are mutually exclusive so

we can add the probabilities For instance the probability of pro ducing

The rst bit in a binary expansion corresp onds to the second bit to the

third to etc Thus for example the binary numb er corresp onds to the decimal

Note that is dened relative to U distinct universal Turing machines induce distinct s

Furthermore here we have one of the technical p oints where the imp ortance of using prex

free Turing machines shows if we considered Turing machines enco ded in a nonprexfree

P

lt

way the sum could go to innity

ft j U t haltsg

RANDOMNESS

Adding up the probabilities for these x we get the total is

i

probability that U will halt on the string pro duced by the monkey

X X

l t l x

i

i

ft j U t halts g

which is our numb er

It is quite remarkable that we can actually dene a numb er which gives

the probability that an arbitrary Turing machine halts Apart from this

has some more interesting prop erties Firstly since it is undecidable whether a

given Turing machine halts it follows that is not recursive It is however

enumerable by a pro cedure outlined b elow

The most interesting prop erty of however is that knowledge of its rst n

bits that is of enables us to nd out which of the rst n Turing machines

n

halt Supp ose we know the numb er and we want to nd out whether some

n

particular Turing machine T of length n halts Consider a program P which

simulates all Turing machines rst it executes the rst step of the rst Turing

machine then it executes the rst two steps of the rst two Turing machines

then it executes the rst three steps of the rst three Turing machines and so

on Then if we run this pro cedure forever each Turing machine which halts

will b e found to halt after a nite numb er of steps Supp ose this program P

maintains a variable initially zero For each Turing machine of length l

l

that it nds to halt let our program P add to Then during the execution

of P will approach from b elow This pro cedure can b e used to show that

is enumerable Let P terminate as so on as What do we know

n

once P has halted If T has b een found to halt during P s execution then of

course we know T halts Now supp ose T has not halted during the execution of

n

P Is it p ossible that T would halt later on If so then since T s

n

contribution of to the halting probability has not yet b een incorp orated in

However since and can only dier in the n th and later bits we

n

n

have Thus if T do es not halt during the execution of P

n n

n n

but would halt eventually then which contradicts

n

Accordingly if T halts during the execution of P then we know it

n

halts and if it do es not halt during the execution of P then we know it never

halts

Knowledge of whether or not certain Turing machines halt is tantamount to

knowing the answers to many mathematical questions Consider for instance

Goldbachs famous conjecture which says that every even numb er is the sum

of two primes We can easily program a Turing machine T which checks if n

is the sum of two prime numb ers which each should b e smaller than n rst

for n then for n then for n and so on and which halts if it

nds an n which is not the sum of two primes This Turing machine T halts

i Goldbachs conjecture is false Supp ose T has length at most n bits and

somehow we know Using the pro cedure outlined ab ove we can nd out

n

To the knowledge of the author this problem is still op en Fermats even more famous

n n n

last theorem there are no natural numb ers x y z and n such that x y z is

often cited in this context but this theorem has nally b een proved a few years ago

CHAPTER KOLMOGOROV COMPLEXITY AND SIMPLICITY

whether or not T halts and hence whether or not Goldbachs conjecture is true

And similarly for all other statements that can b e enco ded in Turing machines

of length at most n Accordingly if some mathematician had a secret way to

access and sucient computing p ower to extract from this whether a given

Turing machine halts he could answer any ma jor question statable in terms

of the halting of Turing machinesand b ecome the worlds most celebrated

mathematician overnight Unfortunately the noncomputability of tells us

that there is no systematic way to access

Furthermore it can b e shown that is a random innite string which will

b ecome imp ortant in the next section there is a constant c such that

K n c for every n

n

This can b e seen as follows Given we can compute the set T of all Turing

n

machines of length at most n that halt There clearly exists a nite string x

that is not computed by any of the Turing machines in T and hence for which

we must have K x n Now we can dene a Turing machine T in which

n

is somehow enco ded and which uses those n bits to exhibit one such x This T

requires only K bits to enco de plus some constant c to compute an

n n

x from Here c do es not dep end on n Since K x n and T generates x

n

we must have that

l T K c n for every n

n

Accordingly by Theorem we know is random in the sense of passing all

sequential tests for randomness with resp ect to the uniform measure

The ab ove pro of is related to the RichardBerry paradox which asks for the

following numb er

Let x b e the smallest numb er not denable in less than words

We have just dened x in less than words despite the fact that xs own

denition rules this out Analogously the set of all Turing machines of length

at most n bits sp ecies the set of of all strings eectively denable in at

most n bits The Turing machine T mentioned ab ove exhibits a string x that

is not denable in n or less bits If its length l T were n bits or less than we

had eectively dened x in n or less bits which would turn the RichardBerry

paradox into a formal contradiction Hence l T must b e greater than n from

which the noncompressibility of the prexes of follows

Godels Theorem Without SelfReference

One further application of Kolmogorov complexity lies in a new pro of of Godels

celebrated rst incompleteness theorem God Godels theorem shows a fun

damental limitation of computers no single computer can fully capture mathe

matics in the sense of b eing able to prove every true mathematical prop osition

The philosophical relevance of this theorem shows up for instance in discussions

on the philosophy of mind in particular on theories which take the mindbrain

GODELS THEOREM WITHOUT SELFREFERENCE

to b e something like a computer Do human b eings have the same limitations as

computers People like Roger Penrose Pen answer negatively and conclude

that the mindbrain must b e something more than a computer Many others

such as Douglas Hofstadter Hof hold on to the idea that the mindbrain is

a computer For them the imp ortance of Godelss theorem lies in selfreference

in particular the analogy b etween the selfreferring sentence used in the stan

dard pro of of Godels theorem and the selfconsciousness the ability to think

ab out ourselves that is a fundamental asp ect of our mainbrain Hofstadter

and others take this notion of selfreference or selfconsciousness to b e more or

less the essence of mind and consider Godels theorem imp ortant b ecause it

tells us something fundamental ab out selfreference What is interesting how

ever is that Godels theorem can b e proved without making use of selfreferring

sentences at al l thus undermining the analogy with selfconsciousness

The Standard Pro of

Let us rst sp ell out Godels theorem and its standard pro of in some more

detail A full intro duction to rstorder logic lies b eyond the scop e of the present

thesis we will only explain the things we need referring to BJ for the many

technical details swept under the rug here

Consider the rstorder language of arithmetic the alphab et of which con

sists of a constant a a successor function symb ol s two binary function symb ols

and the binary predicate symb ol the usual quantiers for all and

there exists and connectives not and or if then and

if and only if and some variables and interpunction symb ols We assume the

reader to b e familiar with the usual syntactical rules that sp ecify how terms

and formulas are formed from this alphab et We dene a sentence as a formula

in which all variables are quantied

Now consider the interpretation of this language which has the set of natural

numb ers as domain which assigns the numb er to the constant a the successor

function to s the addition function to and the multiplication function

to and the equality relation over the domain to In this interpretation

every term in the language denotes a natural numb er a denotes the numb er

sa denotes ssa denotes ssa ssa denotes

etc Furthermore every sentence in the language has a truth value in this

interpretation For instance x y y sx is true b ecause every numb er has

a successor while sa sa sssa is false b ecause do es not

equal This interpretation of the language of arithmetic is called the Standard

Model of arithmetic in the sequel when we sp eak of a true sentence we mean

a sentence that is true in this Standard Mo del

Let us denote the set of true sentences by T For every sentence in

the language of arithmetic either or but not b oth is a memb er of T

Thus T contains all and only true arithmetical statements Since large parts

of mathematics can b e translated into arithmetic the contents of T are clearly

of the utmost imp ortance In fact if we had ecient mechanical access to T

many mathematicians would b e out of work How can we get a grip on T We

can of course do what arithmeticians have done for ages simply try to prove

CHAPTER KOLMOGOROV COMPLEXITY AND SIMPLICITY

or disprove some interesting statements with hitherto unknown truth values

From a systematic p oint of view however it would b e much more satisfactory

to have an adequate axiomatization of T An axiomatization is a recursively

enumerable set of sentences We will call an axiomatization A sound if it implies

only true sentences and hence no false ones and we call A complete if it implies

al l true sentences Clearly what we are lo oking for is an axiomatization which

is b oth sound and complete ie which implies exactly the sentences in T

nothing more and nothing less In case such an axiomatization exists we would

b e able to enumerate T then there is an algorithm which pro duces ever longer

nite prexes of an innite list containing all and only formulas from T If we

are interested in the truth value of a sentence we can simply enumerate this

list until we encounter in which case is true or then is false Thus

if we have a sound and complete axiomatization of T we eectively hold every

arithmetical truth in our hands

Unfortunately Godels theorem tells us that such an axiomatization do es

not exist The pro of that Godel himself gave which is rep eated in some form

or other in most logic b o oks makes use of a selfreferring sentence similar to

the liars paradox ie the mindb oggling sentence This sentence is false

Ignoring many technicalities for which see BJ Chapter this pro of runs

as follows

Supp ose A is an axiomatization We will show that A cannot b e sound and

complete at the same time We can assign to each sentence in the language

its own unique natural numb er the Godel number of that sentence Similarly

we can assign Godel numb ers to proofs formal derivations from the axioms

of A Now supp ose A is complete Then we must b e able to express each

recursive function In particular we can construct a formula P x y with two

variables x and y which is true just in case x is the Godel numb er of a pro of

of the sentence of which y is the Godel numb er Informally P x y means x

is a pro of from A of y Thus for a particular sentence S with Godel numb er

s S is provable i x P x s is true

Now comes the really interesting and technical part the diagonal lemma

which we will not prove here BJ Lemma p It says that for any

formula Gy with variable y there is a sentence L with Godel numb er l

such that A implies L Gl Supp ose we substitute xP x y y is not

provable for Gy Then we get that there is a sentence L with Godel numb er

l such that A implies L x P x l Thus A implies that L is true i L is

not provable from A Informally L can b e seen as saying I am not provable

from A In order for A to b e sound as well as complete the true statements

should coincide exactly with the provable ones But if L is true then L is not

provable and if L is provable then L is not true Hence A cannot b e b oth

sound and complete at the same time

We assume some complete pro of pro cedure is used so the set of sentences provable from

A is exactly the set of sentences that are logically implied by A That such complete pro of

pro cedures exist is Godels completeness theorem from

GODELS THEOREM WITHOUT SELFREFERENCE

A Pro of Without SelfReference

In this subsection we will use Kolmogorov complexity to outline an alternative

pro of of Godels theorem due to Chaitin Cha Cha Cha which do es

not require the construction of a selfreferring sentence Apparently though

selfreference can b e used to establish Godels theorem it is not necessary for

its pro of This sheds new light on the many cases where Godels theorem is

invoked in discussions ab out selfreference and selfconsciousness

So how can we prove Godels theorem without selfreference By making

use of the incompressibility of the numb er we saw earlier We can dene a

formula O n y with two variables n and y which is true i y ie y is

n

the rst n bits of again we leave out the technical details of the denition

Thus exactly the following instances of O n y are true

O

O

O

In order to improve readability we have written instead of the more correct

sa in the ab ove formulas instead of ssa etc Now we can show that

for every sound axiomatization A there exists a numb er k such that A cannot

imply any true sentence of the form O n y for n k In other words the

prexes of longer than k bits are b eyond the grasp of A From this Godels

theorem immediately follows since each of the formulas

O k

k

O k

k

O k

k

will b e true but unprovable

Given a sound A how can we prove the existence of such a k Supp ose

such a k do es not exist Then for every k there is an n k such that A

implies O n There exists a Turing machine which enumerates all logical

n

consequences of A including O n Let T b e the shortest such Turing

n n

machine and b b e its length Then we can construct from T a Turing machine

n n

simply enumerates all logical consequences of that generates this T T

n

n n

A using T until it nds some O n y and then outputs y which must b e

n

The length of T will b e at most b log n d where d is some constant

n n

n

indep endent of n the log n term is due to the fact that T must know n in

n

order to know what it is lo oking for Since T generates we must have

n

n

K b log n d

n n

However recall from Section that b ecause is random there is a c such

that

K n c for every n

n

Now cho ose a suciently large m such that b log m d m c and A

m

implies O m We must have K b log m d b ecause of

m m m

CHAPTER KOLMOGOROV COMPLEXITY AND SIMPLICITY

and K m c b ecause of which is a contradiction Hence there

m

must b e a k such that A do es not imply any of the true formulas

O k

k

O k

k

O k

k

Note that while the standard pro of provides us with only one example of a

true but unprovable formula namely L I am unprovable from A the ab ove

pro of gives innitely many such formulas

In a nutshell any axiomatization of arithmetic will contain only a nite

amount of information and hence cannot capture all truths ab out the innite

incompressible string

Randomness in Mathematics

Because is random it contains no regularities whatso ever at least no algo

rithmically detectable regularities each of its innitely many bits is in a way

indep endent of the other bits This means that the only way to know these bits

is to know them explicitlyno formal system is able to unveil each of them for

us From this as we saw Godels theorem immediately follows Our inability

to fully capture in a formal system no matter how complicated shows a clear

limit of formal mathematical reasoning

Now is a rather outlandish real numb er and one may wonder whether

mathematicians in general should b e much b othered by its existence However

the numb er can b e translated to the most elementary part of mathematics el

ementary numb er theory Chaitin has constructed an exp onential Diophantine

equation E an equation built up from nonnegative integer variables and con

n

stants and a nite numb er of additions multiplications and exp onentations

with one parameter n such that E has nitely many solutions if the nth bit of

n

is and E has innitely many solutions if this bit is see LV pp

n

Since is incompressible each of its bits is indep endent of the others

and hence the existence of nitely or innitely many solutions of E for some

n

particular n is as it were a brute fact This is a very strong form of Godels

theorem formal systems are not even p owerful enough to fully capture this

single parameterized equation Chaitin draws rather strong conclusions from

this fact

we see that proving whether particular exp onential Diophantine

equations have nitely or innitely many solutions is absolutely in

tractable Such questions escap e the p ower of mathematical reason

ing This is a region in which mathematical truth has no discernible

structure or pattern and app ears to b e completely random These

questions are b eyond the p ower of human reasoning Mathematics

cannot deal with them Nonlinear dynamics and quantum mechan

ics have shown that there is randomness in nature I b elieve that we

have demonstrated in this b o ok that randomness is already present

SUMMARY

in pure mathematics in fact even in rather elementary branches of

numb er theory Cha p

Here Chaitins rhetorical step from no single formal system can deal with all

solutions to this equation to mathematics cannot deal with them seems a

rather hasty one since no one has a precise denition of what mathematics is

It certainly seems that mathematics in the vague sense of the community of

mathematicians and their work cannot b e equated with some single formal

system Moreover we can agree with Van Lambalgen that the analogy b e

tween s randomness and physical randomness is rather farfetched Lam

p

Still the intractability of even a single equation clashes with the usual way

we lo ok at mathematics as a supremely reasonable and reasoned science where

the true facts are true for a reason Chaitin even go es so far as to suggest that

the existence of such randomness should change the way mathematicians work

Rather than formulating axiomatic systems and trying to prove conjectures

within these systems Chaitin argues mathematicians should work much more

in the way of physics If some conjecture seems plausible but you are not able

to prove it and by Godels theorem it may actually b e true and unprovable

at the same time you may tentatively accept it as a working hyp othesis as a

new axiom exp eriment with it and see what happ ens Indeed this pragmatic

stancedespite b eing in strong contradiction with the Olympian image many

p eople hold of mathematicsseems to b e forced up on parts of mathematics

where certain central conjectures turn out to b e extremely hard to prove or

disprove An example is the adoption of P NP as a working hyp othesis in

complexity theory see fo otnote on p

Summary

The Kolmogorov complexity K x of a string x is the length of its shortest eec

tive description that is the length of a shortest Turing machine that generates

x relative to some xed universal Turing machine U K x is indep endent up

to a xed constant of the choice of U it is noncomputable and converges

to Shannons measure of information content It allows us to give an ob jective

formalization of the notion of simplicity a string theory description is simple

if it has low Kolmogorov complexity Furthermore the random innite binary

strings can b e identied exactly as those strings whose prexes are incompress

ible Finally Kolmogorov complexity enables us to show that Godels theorem

can b e proved without selfreferring sentences

CHAPTER KOLMOGOROV COMPLEXITY AND SIMPLICITY

Chapter

Occams Razor

Intro duction

Occams Razor is one of the most imp ortant principles of metho d in science In

its b est known formulation it says that entities are not to b e multiplied b eyond

necessity This means that if we can explain some phenomenon in two ways

the rst p ostulating less entities than the second then we should cho ose the

rst Somewhat more lib erally without fo cusing on entities we may say that

Occams Razor tells us that among the theories hyp otheses or explanations

that are consistent with the facts we are to prefer simpler over more complex

ones In other words we should weed out all unnecessary complexity Other

names for this are the principle of simplicity or the principle of parsimony

How exactly are we to interpret this principle We can discern at least three

dierent interpretations of the razor

Metho dological In this interpretation selecting simple theories is sim

ply a part of the scientic metho d p erhaps b ecause simpler theories are

easier to work with

Ontological or metaphysical In this interpretation Occams Razor

is a statement ab out the world akin to the laws of physics it says that the

world itself is relatively simple and wellorganized and preferring simple

theories is advisable precisely because the world itself is simple

Aesthetical In this third interpretation beauty is taken to b e a p ositive

indicator of truth simple theories tend to b e b eautiful and b eautiful

theories tend to b e true so selecting simple theories is a go o d thing

Note that the second and third of these are stronger than the rst if the

world itself is simple ontological interpretation or if simplicity indicates truth

aesthetical interpretation then it is clearly go o d metho d to favour simple

Beauty is truth truth b eautythat is all

Ye know on earth and all ye need to know

John Keats Ode on a Grecian Urn

CHAPTER OCCAMS RAZOR

theories metho dological interpretation The converse need not hold the world

may b e complex and the right theories may b e rather ugly while favouring

simple theories may still b e go o d metho d b ecause simple theories are easiest

for us to deal with

Occams Razor plays a prominent role in science as well as in philosophy

of science and philosophy in general the next section will illustrate this with

a numb er of examples However most of those who do the shaving seem to

accept the razor more or less as an article of faith as something intuitively

acceptable which need not or cannot b e proved itself This is a somewhat o dd

stance After all the simplest theory consistent with the available data is not

guaranteed to b e the right one and indeed in many cases it will b e wrong For

example there once were times when circular planetary orbits were consistent

with the data available at that time Would it not b e simpler if the planets

moved in circles rather than approximate ellipses Still they dont Or as a

second example the overly simple early mo del of the atomic nucleus one

of the most striking examples of how physicists can temp orarily b e lead astray

by the selection of complexes from nature on grounds of simplicity The case in

p oint is the mo del of the nucleus built of protons and electrons Pai p

So following Occams Razor in these two cases would lead to false results

Despite the fact that the razor may yield incorrect results in particular cases

it still remains a quite successful guiding principle in science and elsewhere

Accordingly an imp ortant question is what are the justications for Occams

Razor Of course we can give a fairly shallow inductive argument to the eect

that the razorin each of its three interpretationshas worked quite well in

the past and hence will probably keep doing so in the future This however

is rather circular since considerations of simplicity ie Occams Razor itself

are crucial for induction we cannot simply use an inductive argument to justify

Occams Razor

Quine is one of the few who have lo oked a little closer at why we prefer

simple theories His conclusions

We have noticed four causes for supp osing that the simpler hy

p othesis stands the b etter chance of conrmation There is wishful

thinking There is a p erceptual bias that slants the data in favor of

simple patterns There is a bias in the exp erimental criteria of con

cepts whereby the simpler of two hyp otheses is sometimes op ened

to conrmation while its alternative is left inaccessible And nally

there is a preferential system of scorekeeping which tolerates wider

deviations the simpler the hyp othesis for instance changing the

simple value to is more likely to b e seen as a renement

rather than a refutation of an hyp othesis than a change from

to RdW These last two of the four causes op erate far more

widely I susp ect than app ears on the surface Do they op erate

widely enough to account in full for the crucial role that simplicity

plays in scientic metho d Qui p

Quine leaves the last question as a real question But his conclusions are not

very encouraging as they stand since they consider the success of simplicity

OCCAMS RAZOR

Occams Razor to b e mainly an artifact of the way we p erceive and the way

we do science

However we can actually do much b etter than that in certain formal set

tings we can more or less prove that certain versions of Occams Razor work

This is the topic of the present chapter which is organized as follows In the

next section we start with a brief historical overview of Occams Razor and

some examples of its application in science and philosophy Then in the three

ensuing sections we provide three dierent formal justications of Occams

principle In each of these simplicity is measured by means of Kolmogorov

complexity Finally in Section we briey lo ok at what the theory of Kol

mogorov complexity has to say on the very p ossibility of science

Occams Razor

In this section we will rst make some historical remarks on the razor and then

describ e some examples of its inuence in science and philosophy

History

An appropriate place to start the history of Occams Razor is of course William

of Occam himself also sometimes sp elled Ockham Occam was an English

theologian who lived from c to c Despite his rather exciting life

involving conicts with the Pop e and others and his relatively mo dern phi

losophy Occam is nowadays mainly rememb ered for his razor This is often

cited as Entia non sunt multiplicanda sine necessitate entities should not

b e multiplied without necessity dont p ostulate more things than you really

need Notoriously however this formulation has not b een found anywhere in

Occams actual writings as was noted by Thornburn Tho Some similar

formulations which can b e found in Occams work are the following these and

others are cited on pp of Der

Pluralitas non est p onenda sine necessitate A plurality should not

b e p ostulated without necessity

Nulla pluralitas est p onenda nisi p er rationem vel exp eriantiam vel

auctoritatem illius qui non p otest falli nec errare p otest convivi

A plurality should only b e p ostulated if there is some go o d reason

exp erience or unfallible authority for it

Frustra t p er plura quo d p otest eri p er pauciora It is vain to

do with more what can b e done with less

However formulations like these were not originally intro duced by Occam him

self similar ones can b e found in the writings of Occams teacher Duns Scotus

as well as in many other medieval philosopherstheologicians Moreover as

Wil Derkse shows Occams Razor might as well b e called Aristotles Razor

As far as we know the term razor was not used by Occam himself either The rst

known to use the word the French rasoir in this connection was Pierre Bayle in the index

of his Pensees from see pp of Der

CHAPTER OCCAMS RAZOR

b ecause each of its various asp ectsthe metho dological the ontological and

the aestheticalcan already b e found in texts of Aristotle

To what extent is it still appropriate to call Occams Razor Occams From

the ab ove quotations from his work it is fairly obvious that Occam accepted the

razor in its methodological interpretation On the other hand the aesthetical

asp ect seems to b e missing Der p It remains to examine to what

extent Occam acccepted the ontological interpretation of the razor Admittedly

Occam is wellknown for his parsimonious nominalist ontology In particular

he denied separate existence to various kinds of abstractions such as sp ecies

relations causation motion etc This is usually taken to b e the result of his

razor However one might question whether the sparseness of Occams ontology

is due to an application of the razor or to other reasons In fact Roger Ariew

argues for the latter view For Occam the basic principle is not parsimony but

absolute divine omnipotence Now supp ose for instance that relations could

have an existence separate from the relata Since Go d is omnip otent this would

imply that He could create a relation without creating the relata But that is

absurd so relations cannot b e separate things Ari p In general

for Occam Go ds omnip otence bars the ontological assumption that the world

itself is simple since Go d canand sometimes doesmake it more complex

than strictly necessary

Go d do es many things by means of more which He could have done

by means of fewer simply b ecause He wishes it No other cause must

b e sought for and from the very fact that Go d wishes He wishes in

a suitable way and not vainly

Occam as cited on p of Ari

Thus for Occam the razor is a metho dological but not an ontological principle

The razor must b e viewed as a restriction on men not on Go d or any of its

works Ari p Weinb erg reaches the same conclusion For Ockham

there is a principle of Parsimony which applies to human thought not to the

universe Wei p

Applications in Science

Here we will illustrate the inuence of Occams Razor in science with some

examples

The starting p oint of mo dern science is usually taken to b e the work of

Cop ernicus Kepler and Galileo which put cosmology on its present helio cen

tric feet Each of these men had a strong preference for simplicity and can b e

regarded as adherent of the ontological interpretation of Occams Razor nature

itself is simple and economical For instance Cop ernicus writes

Attacking a problem obviously dicult and almost inexplicable at

length I hit up on a solution whereby this could b e reached by fewer

and much more convenient constructions than had b een handed

See Boa and Der Chapter I I for numerous illuminating citations from Aristotles

work

OCCAMS RAZOR

down of old if certain assumptions which are called axioms b e

granted me ie Cop ernicus claims his helio centic mo del requires

far fewer epicycles than Ptolemys geo centric one

Cop ernicus as cited on p of Bur

Many historians and philosophers of science rep eat the claim that Cop ernicus

system is much simpler than Ptolemys for instance Bur p stating that

Ptolemy requires some epicycles while Cop ernicus needs only However

much like the usual wording of Occams Razor this is a myth which has to b e

taken cum grano salis in fact with the whole saltcellar Coh p

Cop ernicus epicycles are not quite the same as Ptolemys and there are many

dierent ways of counting and comparing the numb er of cycles each requires

On quite a few of those counts Cop ernicus system comes out more complex

than Ptolemys Coh p

The real simplication actually only o ccurred somewhat later when Kepler

dropp ed Cop ernicus assumption that planetary orbits are basically circular

with some additional epicycles thrown in to get empirical adequacy in favour

of ellipsoid orbits Like Cop ernicus Kepler explicitly endorsed simplicity

Natura simplicitatem amat Nature loves simplicity

Natura semp er quo d p otest p er faciliora non agit p er ambages

diciles Nature do es not use dicult roundab out ways to do

what can b e done with simpler metho ds

Kepler as cited on p of Bur

And nally Galileo

Nature doth not that by many things which may b e done by

few Galileo Dialogues Concerning the Two Great Systems of

the World Salusbury translation London p as cited on

pp of Bur

When therefore I observe a stone initially at rest falling from

a considerable height and gradually acquiring new increments of

sp eed why should I not b elieve that such increases come ab out in

the simplest the most plausible way On close scrutiny we shall

nd that no increase is simpler than that which o ccurs in always

equal amounts ie gravitation causes constant acceleration

Galileo as cited on p of Der

Isaac Newton hugely contributed to the simplication of physics by bringing

very dierent phenomena falling b o dies the tides the movements of the plan

ets etc under the same relatively simple system of general laws inventing the

required mathematics along the way He like his predecessors was explicitly

guided by a preference for simplicity In the third edition of his Philosophiae

Cop ernicus himself claimed greater simplicity for his system in his early Commentariolus

but no longer in his later and more famous De revolutionibus orbium coelestium

And his system wasnt more empirically accurate than Ptolemys either Coh pp

CHAPTER OCCAMS RAZOR

Naturalis he included the following as the rst of the

Regulae Philosophandi

We are to admit no more causes of natural things than such as

are b oth true and sucient to explain their app earances To this

purp ose the philosophers say that nature do es nothing in vain

and more is in vain when less will serve for nature is pleased with

simplicity and aects not the p omp of sup eruous causes

Newton Principles I I p as cited on p of Bur

This rule of Newtons is just another formulation of Occams Razor

Apart from Newton Alb ert Einstein is probably the most imp ortant physi

cist of all times Not only is he arguably this centurys greatest scientist and

certainly the most idolized one he also was one of those who were most inu

enced by simplicity As Derkse writes Alb ert Einstein can b e considered as a

paradigmatical example of a great scientist who valued simplicity heuristically

sought for simplicity metho dologically and b elieved in simplicity ontologically

Der p This is a recurring theme in Abraham Pais sup erb biography

of Einstein In fact Pais concludes that simplicity was the most imp ortant

driving force b ehind Einsteins development of relativity

Einstein was driven to the sp ecial theory of relativity mostly by

aesthetic arguments that is arguments of simplicity This same

magnicent obsession would stay with him for the rest of his life It

was to lead him to his greatest achievement general relativity and

to his noble failure unied eld theory Pai p

This is corrob orated by many of Einsteins own writings particularly those

from his later years for instance

I do not consider the main signicance of the general theory of

relativity to b e the prediction of some tiny observable eects but

rather the simplicity of its foundations and its consistency

Einstein as cited on p of Pai

As with Newton and the others the preference for simplicity is not merely a

metho dological or heuristical to ol but is taken to corresp ond to simplicity in

the world itself

In my opinion there is the correct path and it is in our p ower

to nd it Our exp erience up to date justies us in feeling sure that

in nature is actualized the ideal of mathematical simplicity

Einstein as cited on pp of Pai

Apart from adopting the ontological form of the razor Einstein and many other

scientists also followed its aesthetical interpretation For instance Paul Dirac

writes

Notice that Pais seems to identify aesthetic arguments with arguments of simplicity

As far as science is concerned he is probably to a very large extent in the right However in

other areas such as art equating aesthetics and simplicity would b e to o simple

OCCAMS RAZOR

It is more imp ortant to have b eauty in ones equations than to

have them t exp eriment It seems that if one is working from

the p oint of view of getting b eauty in ones equations and if one

has really a sound insight one is on a sure line of progress

Dir p as cited on p of McA

We may conclude that some of the greatest advances in mo dern science

Cop ernicus helio centric mo del Newtons theory of gravition and Einsteins

theory of relativitywere highly inuenced by the value those scientists them

selves b estowed on simplicity as a virtue of theories and as an indicator of the

truth of those theories In sum many of the greatest scientists held Occams

Razor in its ontological as well as in its aesthetical form b elieving in the sim

plicity of the world and the truthindicating prop erties of b eauty

Applications in Philosophy

Whether or not philosophy is continuous with science and what exactly this

might mean is a debatable issue which we shall not go into here One imp or

tant feature that philosophy shares with science is the imp ortance it places on

simplicity as a desirable feature of theories and explanations and accordingly

on Occams Razor Particularly in our century many a philosopher invested a

lot of time and eort in Occams Razor For instance used

the razor so extensively that Passmore calls it his main philosophical o ccupa

tion Pas p In this subsection we will mention some applications of

Occams Razormainly in its ontological formin philosophy

The rst and foremost application in philosophy lies in the philosophy of

science Occams Razor closes the gap b etween observation and theory What

is this gap how do es it arise and how can it b e closed The available data

observations exp eriments etcconstitute the raw material on which a scien

tic theory is to b e built However usually more than one theory is consistent

with the available data the data underdetermine the theory That is usually

we can construct manymutually inconsistenttheories to explain the same

set of observational data and the data do not provide us with further criteria

to cho ose b etween these various theories Thus the data leaves op en a lot of

p ossible choices Now Occams Razor closes this gap by stating that of all the

ories consistent with the data we are to cho ose the simplest one Accordingly

if we accept Occams Razor in some precise form where simplicity is measur

able there is hardly any underdetermination left Informally we might even

say that the conjunction of the observational data and Occams Razor entails

which theory we are to adopt Thus Occams Razor may b e seen as one of the

cornerstones of science

Our second example concerns logical positivism the logicoscientically min

ded movement that sprung from the Wiener Kreis and included p eople like

Schlick Carnap and Reichenbach Following the lead of Wittgensteins Trac

tatus Wird ein Zeichen nicht gebraucht so ist es b edeutungslos Das ist

The only case where some underdetermination remains is where the simplest theory do es

not exist ie where several among the consistent theories share the lowest complexity

CHAPTER OCCAMS RAZOR

der Sinn der Devise Occams they attempted to cut away metaphysics by

showing its prop ositions to b e meaningless If some term or prop osition has no

testable links to our exp eriental world they contended it can only b e used in

empty babble but not in any prop er scientically relevant way Accordingly

we can do without it and by Occams Razor we should do without it Since

the logical p ositivists thought they could show al l metaphysical termssuch

as Go d spirit Ding an sichto b e without testable empirical content this

allowed them to cleanly shave away all of metaphysics

Our third example is from the philosophy of mind Here Occams Razor

is used to argue against dualism the p osition which says that the physical

and the mental are dierent kinds of stu Are statements like I am in

pain ab out something irreducibly mental or are they actually ab out material

brain pro cesses A numb er of materialistsmainly Australian ones for some

reasonhave argued for the latter Pla Sma Why Mainly b ecause of

Occams Razor Sma p The Australian strategy is to show that mate

rialism which says that the mental is identical to a brain pro cess is consistent

with the facts and much simpler than comp eting p ositions like dualism After

all materialism requires only one kind of stu dualism requires two which

moreover do not seem to t together very well An application of Occams

Razor then suces to eliminate the comp etition leaving materialism as the glo

rious winner We can see Occams Razor at work in the following quote from

the notable materialist Jack Smart

If it b e agreed that there are no cogent philosophical arguments

which force us into accepting dualism and if the brain pro cess the

ory and dualism are equally consistent with the facts then the prin

ciples of parsimony and simplicity Occams Razor seem to me

to decide overwhelmingly in favour of the brainpro cess theory

Sma p

If Occams Razor helps us decide b etween dualism and materialism it is a very

sharp and p owerful razor indeed

A fourth example is in the eld of ontology Here trope theory has fairly

recently b een put forward as an alternative to the ancient substanceprop erty

ontology Wil Cam Bac According to the latter there are two separate

ontological categories the world consists of things substances which have

prop erties Trop e theory on the other hand argues that there is only one

category namely tropes A trop e is in a way the intersection of a substance and

a prop erty it is a particular prop erty of a particular thing like the greenness of

this particular p ea According to trop e theory trop es are the only ontological

category the world consists of trop es all trop es and nothing but trop es Given

that one ontological category is simpler than two Occams Razor induces us to

favour trop e theory over the substanceprop erty ontology

A similar attitude also app ears in the later Wittgenstein for instance Hier mochte ich

sagen das Rad gehort nicht zur Maschine das man drehen kann ohne da Anderes sich

mitb ewegt Wit x

However it should b e noted that this example diers somewhat from the others Namely

OCCAM AND PAC LEARNING

Our fth and nal example uses Occams Razor as an argument for atheism

or at least for agnosticism b oth atheism and agnosticism would horrify Occam

himself Basically the argument is that science is sucient to explain the

phenomena and hence there is no need to invoke the existence of a Go d in some

form or other as an additional explanatory factor But if we do not need Go d

in our explanations Occams Razor tells us that we should not p ostulate His

existence thus making us either an atheist if we p ostulate His nonexistence

or an agnostic if we susp end judgment as to His existence

As a nal comment on the application of Occams Razor it should b e noted

that such razing arguments are not conclusive in the sense that they do not

establish something b eyond any doubt In each of the ve ab ove examples

Occams Razor provides one argument one go o d reason for the simpler p osition

but not a conclusive pro of After all Occams Razor cannot b e conclusive

always b ecause sometimes the simplest explanation or hyp othesis simply is not

the right one and turns out to b e false later on Nevertheless arguing for some

p osition on the basis of Occams Razor is not futile or empty either b ecause

as we will see one can establish a p ositive correlation though clearly not a

p erfect one b etween simplicity on the one hand and truth or adequacy on

the other Thus given comp eting p ositions it still makes sense to favour the

simpler

Occam and PAC Learning

The examples given in the previous section are probably sucient to convince

the reader of the central place Occams Razor holds in science and philosophy

In this and the following two sections we will see how we can mathematically

formalize and prove versions of Occams Razor Each of these three versions

is closer to the metho dological than to the ontological interpretation of the ra

zor though they also capture the aesthetical interpretation to the extent that

simplicity can b e equated with b eauty Clearly in order to formalize Occams

Razor we need some quantitative measure of simplicity Not surprisingly Kol

mogorov complexity will serve this role in each of the three settings The rst

of these the topic of the present section deals with PAC learning

Occams Razor states that the simplest consistent hyp othesis should b e

selected However in many cases it is computationally rather costly to really

nd the simplest hyp othesis while it is often much easier to nd a relatively

simple hyp othesis For instance given two nite sets S and T of sentences

known to b e grammatical and ungrammatical resp ectively it is not very

dicult to nd a Deterministic Finite Automaton or a regular grammar which

generates a language L that contains S and is disjoint from T whereas nding

the smal lest such DFA is known to b e NP complete and hence probably not

eciently solvable GJ p Accordingly we will weaken Occams Razor

somewhat to the following

Selecting relatively simple hyp otheses is a go o d strategy

the choice of substanceprop erty versus a trop e ontology is more like a choice of conceptual

scheme or language than a choice b etween comp eting empirical hyp otheses

CHAPTER OCCAMS RAZOR

Below we give a theorem which can b e seen as a formal counterpart to this

version of the razor within the PAC learning framework Here we will assume

some representation R without explicitly mentioning R each time Further

more we assume R as well as the domain have a binary alphab et f g so all

names of concepts and all examples are binary strings

An Occam algorithm is a learning algorithm that follows Occams preferred

strategy it reads examples and outputs a consistent hyp othesis that is signi

cantly simpler than those examples The main result of this section will b e that

an ecient Occam algorithm is an ecient PAC learning algorithm Informally

this means that if we follow Occams Razor then we thereby automatical ly learn

probably approximately correctly

There exist several subtly diering versions of the Occams Razor theo

rem It was originally proved by Blumer Ehrenfeucht Haussler and Warmuth

BEHW but see also AB Theorem Nat Theorem KV

Theorems and LV Theorem Most of these results measure

the simplicity of a hyp othesis by the length of the name the learning algorithm

outputs LV is the only one that uses the more sophisticated approach

of measuring simplicity by the Kolmogorov complexity of that name Unfor

tunately the PACframework used there is somewhat simpler than the one

adopted by us in particular it takes and there is no length parame

ter n Accordingly we will have to prove a version of our own adapting and

combining some denitions and pro ofs that can b e found in the literature

Formally an Occam algorithm is dened as an algorithm that when given a

set of examples outputs the name of a concept consistent with those examples

with small Kolmogorov complexity

Denition Let F b e a concept class A learning algorithm L for F is

called an Occam algorithm if there exist constants and such

that whenever L is given a set S of m examples for a target concept f L

outputs a name r of a concept g satisfying

g is consistent with S

K r mn l f where n is the length parameter

min

Moreover L is a polynomialtime Occam algorithm if also

There is a p olynomial pn m such that given length parameter n and

m examples L needs at most pn m steps

3

The interesting and p erhaps puzzling part of this denition is the condition

K r mn l f It says that the Kolmogorov complexity of the output

min

should b e b ounded by a p olynomial in mn the numb er of examples times

the maximal length of examples and l f the shortest name of the target

min

concept f If we have a set S of m examples each of length at most n bits then

we can often simply construct a consistent hyp othesis by recording the given

examples and their lab els explicitly and assigning all other unseen elements

OCCAM AND MINIMUM DESCRIPTION LENGTH

of the domain the lab el Since each example is at most n bits recording m

examples and their lab els takes roughly mn bits This concept represents the

examples explicitly but do es not really learn anything However b ecause

and l f and n are xed for suciently large m we get mn l f

min min

mn so an Occam algorithm will have to nd a g which is simpler than the

concept that simply records the given examples Accordingly for suciently

large m an Occam algorithm will have to compress the data the output concept

should b e simpler than the examples themselves

The following theorem shows that Occams Razor really works in the sense

that an Occam algorithm can meet the requirements of a PAC algorithm In

order not to impair readability we have deferred the very technical pro of to the

app endix of this chapter

Theorem Let F be a concept class If there is a polynomialtime Occam

algorithm for F then F is polynomialtime PAC learnable

Cutting through the ab ove notation the result says the following if we

can follow Occams Razor in the sense of having an ecient algorithm that

selects short hyp otheses then this algorithm is automatically guaranteed to

learn probably approximately correctly In less words eciently selecting short

hyp otheses is a go o d strategy and we can formally prove this in the PAC

framework

It is imp ortant to note carefully what has and has not b een proved here

What has b een proved is that if we can follow Occams Razor in the sense

of having an ecient Occam algorithm then this algorithm will b e a go o d

ie PAC hyp othesis selector What has not b een proved is that we can

always follow Occams Razor One can easily think of cases for instance if

the domain X consists of all nite binary strings and as concept class we have

X

F where a suciently long given sequence of m examples each of length

n has a Kolmogorov complexity of ab out mn In such cases compression of

the examples to within the b ound K r mn l f mn will not always

min

b e p ossible no Occam algorithm will exist in this case In sum we can prove

that following Occams Razor is a go o d thing but in some cases it may not b e

p ossible to implement the razor

Occam and Minimum Description Length

Of the three formalizations of Occams Razor discussed in this chapter the

Minimum Description Length MDL principle probably comes closest to our

informal understanding of Occams Razor It was invented by Jorma Rissa

nen Ris Ris motivated by Ray Solomono s work to b e describ ed in the

next section and tells us to select the theory which most compresses the data

There actually exist some weak converses to related forms of this theorem see KV

Exercises and and Nat Theorem which show that under certain conditions

classes that are eciently learnable are also learnable using Occam algorithms Unfortunately

we have not b een able to prove an exact converse to our present version of the theorem

CHAPTER OCCAMS RAZOR

Given a sample of data and an eective enumeration of the appro

priate alternative theories the b est theory is the one that minimizes

the sum of

the length in bits of the description of the theory

the length in bits of the data when enco ded with the help of

the theory

The idea here is that we should balance b etween putting to o much and to o

few detail in the theory In the most likely cases the data contain a more or

less regular pattern with some additional noise irregularities due to measuring

errors etc If we tried to include the noise in the theory the description of the

theory would get rather large and detailed while the length of the data enco ded

with the help of the theory would get smaller since the data can then already

b e reconstructed largely from the theory itself However in this case we would

probably overt the data ie we would lose predictive p ower by fo cussing to o

much on the accidental noisy details of the given data On the other hand we

can minimize the length of the theory by including al l the details of the data in

the second part But in this case we would have a virtually empty theory and

again no predictive p ower The right approach lies somewhere in the middle

and achieves predictive p ower by balancing the descriptions of the theory and

data by minimizing the sum of their lengths

To the extent that Occams Razor is intuitively acceptable the MDL prin

ciple seems acceptable as well In fact however under certain conditions it

can b e proved that MDL do es indeed work well Supp ose we want to explain

some data D Let H fH H g b e the nite or countably innite set of

all p ossible hyp otheses or theories We assume each p ossible D and each H

i

is somehow enco ded as a binary string These H are assumed to b e exhaus

i

tive as well as mutually exclusive Furthermore we assume each H induces

i

a probability distribution PrjH on the p ossible data D so it makes sense

i

to sp eak of the conditional probability PrD jH that D arises given that H

i i

is true We will asume each PrjH to b e recursive For a recursive prob

i

ability distribution P we dene K P to b e the length of a shortest Turing

machine that computes P x given x for all x Let P b e an a priori probabil

P

PH PH can b e ity distribution on the p ossible hyp otheses

i

H H

interpreted as the probability we are willing to confer up on H before having

i

seen any data P can b e used to dene an a priori probability on the p ossible D

P

PrD jH P H This is the sum over all H of the probability PrD

H H

that H is true and causes D If b oth Pr and P are computable then PrD

can b e computed or approximated if H is innite

So we have a prior distribution on the hyp otheses and on the data Now

we observe the data D which induces us to adjust the probabilities we confer

up on the dierent hyp otheses For example if PrD jH then observing

i

D eectively rules out the p ossibility that H is the right hyp othesis and we

i

can put the probability PrH jD given D we know H is imp ossible

i i

Similarly we can up date the other probabilities incorp orating the fact that

D was observed Bayess Theorem or Bayess Rule which is actually due to

Laplace rather than Bayes is basically a rule which tells us how to up date the

OCCAM AND MINIMUM DESCRIPTION LENGTH

a priori probability PH to an a posteriori probability PrH jD

PrD jH P H

PrH jD

PrD

This rule gives a mathematically sound way to up date the probabilities con

ferred up on the hyp otheses given that D has b een observed The b est hyp oth

esis H is the most probable one given D ie the H that maximizes PrH jD

there may actually b e several dierent such H so we should rather sp eak of a

b est hyp othesis rather than the b est hyp othesis

This approach gives a sound and ob jective way to select an optimal hyp oth

esis and if we have P as well as the resources to do all required computation

then this approach is the b est way to select a hyp othesis Unfortunately the

problem is that we usually will not know P Thus the Bayesian approach is an

optimal but usually unworkable hyp othesis selection scheme

However using Kolmogorov complexity we can show that the MDL principle

is often a very go o d approximation to the ab ove Bayesian approach First if we

take negative logarithms on b oth sides of Bayess Rule we get the equivalent

log PrH jD log PH log PrD jH log PrD

Since log PrD is xed indep endent of H cho osing a hyp othesis H which max

imizes PrH jD is equivalent to cho osing an H which minimizes log P H

log PrD jH Thus hyp othesis selection according to Bayess Rule is equiva

lent to selecting a hyp othesis H that minimizes log P H log PrD jH Of

course the same old problem still obtains we do not know PH

Now supp ose we replace this unknown prior P by the following universal

distribution

K x

mx

This distribution assigns each natural numb er or binary string x a nonzero

probability The universal distribution incorp orates Occams Razor by giv

ing simple hyp otheses high probability if K H is small then mH will b e

relatively high This m is enumerable but not recursive It has the prop erty

that for every enumerable probability distribution P in particular for every

recursive P there is a constant c such that for every x we have cmx P x

What happ ens if we replace P by m Let us rst imp ose three conditions

we restrict attention to recursive P the true hyp othesis H is P

random and the data D is PrjH random The latter two conditions

An alternative is to use the Maximum Likelihood principle which selects a H with highest

PrD jH In other words this principle favours a hyp othesis that makes the data most

probable However this principle may give strongly counterintuitive results For instance the

principle may well select D as its rather useless hyp othesis b ecause PrD jD is maximal

P

mx LV Lemma so m is not quite a probability It can b e shown that

x

distribution However we can construct an additional dummy ob ject u and dene mu

P

mx in order to make probabilities sum to one

x

It should b e noted that we attach no ob jective character to the universal prior probability

we do not consider it the true probability of theories it is not even clear what this might

mean Rather we take an instrumentalist stance satisfying ourselves with the conclusion

that using m as a prior gives go o d results

CHAPTER OCCAMS RAZOR

informally mean that H is a typical hyp othesis for P and D is typical

data for H Since an overwhelming ma jority of all strings will b e typical in

this sense this is not a very strong restriction Under these conditions it can

b e shown that the following inequality holds LV p

K H K P log P H K H

This means that log P H and log mH K H are equal to within a

xed constant namely K P which is indep endent of H

Similarly if we dene K xjy as the length of a shortest Turing machine

that generates x given string y on a special input tape then under the ab ove

conditions log PrD jH and K D jH are approximately equal

K D jH K PrjH log PrD jH K D jH

As we saw our original aim to maximize PrH jD is equivalent to mini

mizing log PH log PrD jH Since K H approximates log PH and

K D jH approximates log PrD jH minimizing log P H log PrD jH

now turns out to b e approximately equivalent to minimizing K H K D jH

That is we want to select a H which minimizes the description length of H and

the description length of the data D enco ded with the help of H But this is

just what the MDL principle says Therefore the Bayesian approach which is

optimal but infeasible b ecause we do not know the prior P can b e approximated

using Kolmogorov complexity which gives us the MDL principle

How go o d an approximation is MDL to the Bayesian selection scheme If

we dene P H K PrjH K P and add up inqualities and we

obtain the following

K H K D jH P H log P H log PrD jH K H K D jH

We call an H admissible if it satises this inequality As we can see if this

inequality holds and P H is small then K H K D jH and log PH

log PrD jH will b e close to each other and the MDL principle will provide a

go o d approximation to Bayesian hyp othesis selection The following is Theo

rem of LV

Theorem Let P H be smal l Then Bayess Rule and MDL are op

timized or almost optimized by the same hypothesis among the admissible

H s That is there is one admissible H that simultaneously almost mini

mizes both log P H log PrD jH selection according to Bayess Rule and

K H K D jH selection according to MDL

It is not a very surprising requirement that the data should b e random or typical in order

for MDL to work After all nontypical data may well b e very misleading For example

supp ose we are given some data p ertaining to the movements of the planets a numb er of

p oints in space and time Since the planetary orbits are not circular typical data will b e

inconsistent with the theory that planets move in p erfect circles around the sun However if

we are given nontypical data which are consistent with this theory then we cannot b e blamed

for jumping to the simple but wrong conclusion that planets move in circles Analogously

MDLs preference for simple theories may go wrong if the given data is nontypical and just

happ ens to b e consistent with one or more very simple but wrong theories MDL would select

an overly simple and incorrect theory but it cannot b e blamed for this if the right information

just isnt there MDL cant b e exp ected to extract it from the data

OCCAM AND UNIVERSAL PREDICTION

Thus Occams Razor in the form of the MDL principle can b e justied

if the true hyp othesis and the data are typical Notice however that the

noncomputability of Kolmogorov complexity makes MDL in this form still

b eyond algorithmic means The problem now is no longer that we do not know

the prior P but that we generally cannot compute the description lengths

K H and K D jH Therefore most realworld versions of MDL restrict the

domain of application such that description lengths are eciently computable

Nevertheless despite the fact that Occams Razor in its MDLform is not fully

implementable the ab ove results do vindicate the razor by showing that if we

somehow follow the razor we will probably get go o d results

Occam and Universal Prediction

In this section we will not b e concerned with cho osing a go o d theory or hy

p othesis but with predicting the future from the past Here we will describ e

Ray Solomono s prediction pro cedure for which Kolmogorov complexity was

initially intro duced Sol

First let us mention explicitly that optimal prediction do es not coincide

with prediction according to the b est hyp othesis H In fact sometimes predic

tion can yield b etter results if we refrain from explicitly selecting one hyp othesis

among the many p ossibilities Consider an unfair coin which has an unknown

chance p of coming up heads Supp ose there are only two hyp otheses p ossible

H says that p p and H says p p and we have determined

that the probability of H b eing true is the probability of H is Then

we clearly should select H as the b est most likely hyp othesis Yet the b est

prediction for p given our knowledge is p p If we were

to b et against this coin then the prediction p would b e more protable

than the H prediction p Thus the b est hyp othesis need not give the

b est prediction

The formal setting in which the prediction will take place is simple abstract

and austere We are given a nite initial segment x of a binary sequence and

our aim is to predict how the sequence will continue Intuitively we can think

of the given initial segment x as a description of relevant asp ects of the past

and our predicting the continuation of the sequence is like predicting the future

Where do es this x come from Informally it may b e seen as the outcome of

past observations or exp eriments Formally we assume it is drawn according

to some unknown semimeasure on the set of innite binary sequences

What is a semimeasure Recall from Section that a measure on f g

the set of innite binary sequences is a function from f g to such

that

x x x

An interesting historical p oint for the relation b etween philosophy and Kolmogorov

complexitybased prediction during his physics studies at the University of Chicago in the

late s Solomono followed a course given by Rudolf Carnap who was at that time very

active in research on induction and prior probabilities

CHAPTER OCCAMS RAZOR

Here x is interpreted as the probability that a string from f g starts

with x A semimeasure needs to satisfy only the following weaker conditions

x x x

Clearly a measure is a semimeasure but not always vice versa

As mentioned our given x is a description of parts of the world and we

assume this world to b ehave in accordance with some unknown semimeasure

which assigns a numb er x to every nite binary string x Given we can

dene the conditional semimeasure y jx as

xy

y jx

x

Intuitively y jx is the probability that given an initial segment x the

sequence will continue with y its not quite a probability b ecause the proba

bilities need not sum to one in a semimeasure and need to b e renormalized but

it is easiest still to think of x and y jx as probabilities So for example

if and then j If more

over j then we can interpret this as an chance that the next

bit will b e a and a chance that it will b e given as initial segment

Note that this setting is able to incorp orate a nondeterministic world

Given x we want to predict how the sequence continues in such a way that

our predictions do not diverge to o far from the true but unknown distribution

Let us restrict attention to predicting only the next bit of the sequence further

bits can then b e predicted by reiterating the pro cedure As our world may b e

probabilistic nondeterministic it is b est if our predictions are probabilistic

as well rather than denitely predicting the next bit will b e a we should

make predictions of the form with probability the next bit will b e a

How can we make a go o d prediction in a uniform way In general we cant

if we allow arbitrary semimeasures as our world then basically anything

can happ en and there is no prediction metho d that will work well universally

However now supp ose we restrict attention to recursive s that is to s for

which there is an eective pro cedure to calculate x for every x This is not

a very strong restriction for example all usual distributions one nds in b o oks

on statistics the normal one the uniform one the exp onential one etc are

recursive

Now almost miraculously under this mild restriction we can use a single

semimeasure the universal semimeasure M to predict how the sequence x will

continue no matter what the actual is Informally this universal M in a

way combines all enumerable semimeasures weighed according to their Kol

mogorov complexity We know what the Kolmogorov complexity of a binary

The fact that probabilities need not sum to one in a semimeasure is a technical con

venience One can always renormalize a semimeasure to a measure For example if we

are given x as initial segment and we have x and x then we can use

P x and P x as probabilities which sum to one However in general the

measure obtained by renormalizing an enumerable semimeasure need not b e enumerable itself

an example is M dened b elow

OCCAM AND UNIVERSAL PREDICTION

string is but what is the Kolmogorov complexity of an enumerable semimea

sure One can eectively enumerate all enumerable semimeasuresmore pre

cisely the Turing machines that enumerate themin a sequence

see the pro of of Theorem of LV Given a particular enumeration like

this we can dene K K i the complexity of an enumerable semimea

i

sure is the complexity of its index in the enumeration after all given a Turing

machine that constructs the enumeration all we need in order to b e able to

construct is its index i Now M is dened as the weighed sum of all these

i

i

X

K

n

Mx x

n

n

Each is a p ossibility for the true the one from which x has b een drawn

i

and hence may b e seen as a p ossible hypothesis The ab ove denition of M in

corp orates Occams Razor in giving preference to simple hyp otheses the simple

ones the ones with low K are considered more preferable and are assigned

i

K

i

high weight accordingly The universal M is an enumerable semimea

sure but it is not recursive Universal here means that M multiplicatively

dominates all enumerable semimeasures for every such semimeasure there

i

is a constant c such that c Mx x for every x Namely if we put

i i

P

K K K

n

i i

x c x x for c then c Mx c

n i i i i i i

n

every x Actually the particular M that we dened here is just one example

of a universal distribution there are others with the same prop erty

To rep eat our prediction problem once more we get an initial segment x

and we want to predict the next bit which is either or Let us lo ok at

predicting the probability of getting a as next bit The true prediction would

of course b e jx But alas is not known and there would not b e much

to learn if it were What happ ens if we use Mjx MxMx as our

prediction Surprisingly it can b e shown that this gives a very go o d prediction

indeed still assuming to b e recursive How go o d a prediction is Mjx

compared to jx Supp ose our initial segment x has length n and we

want to predict the nth bit The following S measures the sum of the errors

n

for all p ossible x of length n we square the errors in order to avoid p ositive

and negative errors to cancel out each other

Denition S is the exp ected square of the dierence in probability

n

and Mprobability of o ccurring at the nth prediction

X

S xMjx jx

n

l xn

3

The following result LV Theorem tells us that M is a go o d pre

dictive to ol for any recursive

This is analogous to the universal distribution m of the last section However note the

distinction b etween m and M m assigns a probability to nite strings while M assigns a

probability to nite prexes of innite strings

CHAPTER OCCAMS RAZOR

P

S ln K Theorem If is a recursive semimeasure then

n

n

K

This means not only that the error S converges to but also that it

n

do es so rather fast S must converge to faster than n do es in order for

n

P

S ln K to hold

n

n

It is instructive to see exactly in which way Occams Razor is justied by

the ab ove results These results do not prove a version of Occams Razor

rather we presuppose Occams Razor by weighing the dierent according

i

to their complexity in the construction of M and show that it has b enecial

consequences to do so namely a provably successful induction pro cedure Let

us stress again however that such an abstract justication of Occams Razor

do es not imply its ecient applicability In particular the universal measure

M though enumerable is not recursive and getting a go o d approximation of

its value can b e very exp ensive computationally

On the Very Possibility of Science

Science aims at describing the multitude of observations and data in simple

and elegant theories In other words science aims at compression As we have

seen in the last three sections that compression is a go o d strategy can even

b e justied mathematically However in order for science to b e p ossible in the

rst place compression should b e p ossibleif there is nothing we can compress

there is nothing to learn and everything will just b e an incomprehensible urry

Let us consider the p ossibility that there is nothing we can compress so

nothing that science can successfully work on Supp ose the long binary string

x is a complete description in some sense of our universe We can roughly

distinguish two cases x is signicantly compressible ie K x is much

smaller than the length of x or it is not In the rst case there is clearly

a structure inherent in x that scientic research can latch on to But what

ab out the second case if x is completely random how then is science p ossible

Fortunately in this case it can b e shown that the irregular x will probably

contain some very regular nonrandom substrings LV Section contains

some results to this eect This result is not as surprising as it may seem at

rst sight After all if we ip a fair coin a huge numb er of times then the

resulting sequence of headscoins will probably b e completely random but

still a long sequence of heads is b ound to come up at some p oint simply

b ecause of the laws of probability So in this second case even though the

world as a whole ie x is random and a successful theory of everything

would b e b eyond us some parts of the world substrings of x will b e non

random and will enable fruitful scientic research to take place In either case

whether x is compressible or not we can rest assured it is reasonable to

exp ect that science is at least p ossible in some domains or parts of the world

Accordingly scientic research is concerned rst with identifying which

substrings ie which parts of the world are suciently regular to enable

P

This is so b ecause n ln K

n

SUMMARY

fruitful research and second with identifying what exactly the structure in

those regular substrings is Clearly whether a particular science can b e suc

cessful dep ends on the regularity of its sub ject matter it seems fair to say that

the substrings of our world that successful sciences like physics are working on

are much more regular than the substrings that are the ob ject of for instance

so ciology

Summary

Occams Razor tells us that we should prefer simpler theories over more complex

ones a prescription which is generally followed in science as well as philosophy

What justies this razor We describ ed three dierent formal settings in which

a form of Occams razor could b e mathematically justied Firstly in the PAC

framework an Occam algorithm is an algorithm that outputs names of concepts

with small Kolmogorov complexity compared to the given examples and the

target concept An ecient Occam algorithm automatically learns probably

approximately correct concepts Secondly the Minimum Description Length

principle tells us to select a hyp othesis such that the Kolmogorov complexity of

hyp othesis examples is minimized It can b e shown that in typical cases

this approach do es indeed select an approximately optimal hyp othesis Thirdly

Solomono s prediction pro cedure predicts the continuation of binary sequences

using a combination of all p ossible computable probability measures weighed

according to their complexity Such predictions quickly converge to the true

values

What these three approaches have in common is that they show that sim

plicity is to b e favoured and hence compression is a go o d thing in science

In the last section we saw how the theory of Kolmogorov complexity renders

it very likely that compressionand hence successful scientic activityis at

least p ossible in some domains

CHAPTER OCCAMS RAZOR

A Pro of of Occams Razor PAC Version

In this app endix we give the pro of of Theorem from Section which

stated a version of Occams Razor in the framework of PAC learning Except

for using a Kolmogorov complexity b ound rather than a simple length b ound on

the names the algorithm outputs this pro of is analogous to the pro of of AB

Theorem

Theorem Let F b e a concept class If there is a p olynomialtime Occam

algorithm for F then F is p olynomialtime PAC learnable

Pro of Let L b e a p olynomialtime Occam algorithm for F Consider a target

concept f F and a distribution P on the domain L reads a set S of m

examples for f and outputs a name r of a concept g F consistent with S

satisfying K r mn l f We will see how we can cho ose m in such

min

a p olynomiallyb ounded way that the requirements of a PAC algorithm are

satised In the following we use l to abbreviate l f

min

First note that the numb er of binary strings of length at most mn l is

mn l mn l

L can only output a name r for a concept g if K r mn l Hence the

mn l

numb er C of concepts for which L can output a name satises C

Let us call a concept g F bad if it has to o large an error Pf g The

probability that some particular bad concept g is consistent with one example

for f is P f g hence the probability that g is consistent with

m

each of our m examples is at most Now the probability that at least

one of the C p ossible output concepts of L is bad is at most

m l mn m

C

If we can b ound this probability by then with probability at least L

will output a go o d ie nonbad concept thus satisfying the requirements of

a PAC algorithm Accordingly we want to cho ose m such that

mn l m

Equivalently taking natural logarithms on b oth sides

m

mn l ln ln ln ln

If we abbreviate A n l ln and B ln we can rewrite this to

m

Am B ln m ln

Because ln the ab ove inequality is implied by the following

Am B m

SUMMARY

Dividing by m yields

A B m

m

Since B m B the ab ove inequality is implied by

A B

m

This inequality holds if we cho ose

A B

m

Thus cho osing m in this way L will output a name of a concept g such that

with probability at least we have P f g

It remains to check the eciency of L Since L is an Occam algorithm it

runs in time p olynomial in m and n Furthermore it is easy to see that our

choice of m can b e b ounded from ab ove by a p olynomial in n and l

It follows that L runs in time p olynomial in n and l in accordance

with Denition Hence F is p olynomialtime PAC learnable 2

CHAPTER OCCAMS RAZOR

Chapter

Summary and Conclusion

Though each of the previous chapters had its own short summary it might

b enet the readers overview of the thesis if we wrap things up once more

Accordingly in the following pages we will take sto ck summarizing in non

technical language what we have done

Computational Learning Theory

Chapter intro duced computational learning theory the branch of Articial In

telligence which studies the complexity and other theoretical prop erties of mech

anisms for learning Given some examplesp ositive and negative instances of

some unknown target concept a subset of a domain of ob jectsour aim is to

learn this concept However since the examples will usually not give complete

information ab out the target we cannot exp ect to learn this target p erfectly

Instead we can only hop e to learn an approximately correct concept a con

cept which diverges only slightly from the target in the sense that the target

and the learned concept will agree on most memb ers of the domain Moreover

since the given set of examples may b e biased and need not always b e a go o d

representative of the target concept as a whole we cannot even exp ect to learn

approximately correctly every time Accordingly the b est we can do is learn

a concept which is probably approximately correct PAC

The mo del of PAC learning formalizes this idea in precise mathematical

terms In this mo del a class of concepts is considered to b e learnable if there

exists an algorithm which eciently learns a PAC concept whenever the target

is drawn from that class This can b oth b e seen as a rough mo del of learning

by children or human b eings generally and as a mo del of theory construction

in empirical science

The PAC mo del can b e extended in various ways Firstly we may allow the

learning algorithm access to an oracle The oracle can answer certain questions

p osed by the learner for instance memb ership queries which ask whether a

certain ob ject is a memb er of the target or equivalence queries which ask

whether a certain concept is equal to the target In the latter case the PAC

Polynomialtime PAC learnable or p olynomialtime PAC predictable

CHAPTER SUMMARY AND CONCLUSION

requirement is usually strengthened to the requirement that the target is iden

tied exactly by the learning algorithm Learnability in this stronger mo del

implies learnability in the PAC mo del Secondly we can make the PAC mo del

more realistic by allowing the examples to contain some noise errors and

examining learnability in the presence of such noise

Language Learning

Since the late s the work of Noam Chomsky has dominated linguistics

and has revived the old debate ab out innate knowledge Sp ecically Chomsky

argues that the sp eed and accurateness with which children learn their native

languagedespite the p overty of the input sentences the examples they re

ceive from parents and otherscan only b e explained by p ostulating that the

child is already b orn with some linguistic bias some preknowledge of its na

tive language Without such a bias natural languages would not b e learnable

Chomsky particularly fo cuses on syntax identifying a hierarchy of four classes

of languages Typ e to Typ e with increasing syntactical complexity Each of

these classes prop erly includes the simpler ones The class of Typ e languages

contains all languages that can b e enumerated by an algorithm and hence is

the broadest class that lies within the reach of algorithmic means

In Chapter we lo oked at learnability results p ertaining to such classes of

languages and the results were rather negative The class of Typ e languages

is not learnable in the PAC mo del without memb ership queries and neither

are the more complex classes We dened an even simpler class the class of

Typ e languages which also turned out to require to o much examples to b e

eciently learnable The Typ e class is learnable given the ability to make

memb ership queries and the Typ e class may b e learnable in this case this

is not known to the author However these classes are still to o restricted to

contain natural languages such as English and Dutch Finally the Typ e and

Typ e classes are not even learnable with memb ership queries

We can see from ordinary children that the class of natural languages is

learnable in some nontechnical sense Assuming the PAC mo del to b e su

ciently realistic we conjectured that this class is also learnable in the technical

PAC sense From this it follows that the class of languages that children can

learn cannot b e some broad class such as the Typ e or Typ e languages

Therefore in order to b e able to explain how language learning can take place

we must conclude that the class of natural languages is very restricted and a

child must somehow have preknowledge of the particular restrictions Thus in

deed we must have a linguistic bias in favour of languages of some sp ecic kind

as Chomsky argued Whether this bias is innate cannot conclusively b e proved

using learnability arguments but do es seem to b e the most likely option

Kolmogorov Complexity

Chapter discussed Kolmogorov complexity and some of its philosophically in

teresting consequences and applications Technicalities apart the Kolmogorov

OCCAMS RAZOR

complexity K x of a nite binary string x is the length of a shortest Turing

machine which pro duces that string when fed into some universal Turing ma

chine U The choice of U can aect the value of K x only up to a constant

that do es not dep end on x which makes K x a suciently ob jective measure

of the complexity of x The function K is not computable but it can b e algo

rithmically approximated For longer strings it converges to Shannons measure

of information content

Kolmogorov complexity can b e used to give an ob jective denition of the

degree of simplicity of descriptions data theories etc represented as binary

strings a description is simple to the extent that it has low Kolmogorov com

plexity Because of the great imp ortance of the notion of simplicityesp ecially

in Occams Razor see next sectionthis is a signicant formalization Partic

ularly as a prop erty of scientic theories simplicity is strongly correlated with

the sub jective notions of elegance and b eauty

Secondly Kolmogorov complexity can b e used in a formalization of the

prop erty of randomness of nite or innite binary strings with resp ect to some

probability distribution P Roughly a string is Prandom if it p ossesses all

prop erties one can attribute to random strings ie if it passes all eective

tests for randomness The set of all tests for Prandomness can b e combined

in a single universal test In case P is the uniform measure which distributes

probability uniformly randomness varies with incompressibility a nite string

is random to the extent that it is incompressible x is incompressible if K x is

near to the length of x and an innite string is random if each of its prexes

is incompressible One imp ortant example of a random innite string is the

binary expansion of the numb er which is the probability that a randomly

drawn binary string enco des a halting Turing machine The rst n bits of this

numb er contain sucient information to nd out whether any Turing machine

of length at most n halts and hence to nd out the answers to the questions

that can b e enco ded in such machines

Finally we can use Kolmogorov complexity to prove Godels imp ortant in

completeness theorem without invoking the selfreferring constructs employed

in the usual pro of Basically the pro of shows that any nite or recursively

enumerable set of axioms will have only a nite complexity and hence will not

contain sucient information to prove all truths ab out the innite incom

pressible string Hence it is not p ossible to capture all mathematical truths

in a formal axiomatic theory

Occams Razor

Chapter dealt with mathematical justications of Occams Razor In its

most often cited formwhich cannot b e found in Occams writingsthis says

that entities are not to b e multiplied without necessity More broadly we

can render Occams Razor as saying that we should always prefer the simplest

hyp othesis or theory among those that are consistent with the data This prin

ciple can b e interpreted in at least three ways as merely a principle of metho d

ontologically Selecting simple theories is go o d b ecause the world itself is rela

CHAPTER SUMMARY AND CONCLUSION

tively simple or aesthetically Beautyof which simplicity is an imp ortant

asp ectindicates truth The preference for simplicity profoundly inuences

science and philosophy but is usually accepted without further justication

The problem in formally justifying Occams Razor lies in the notion of sim

plicity If we solve this by identifying simplicity with low Kolmogorov complex

ity we can give formal justications of the razor in three dierent contexts

Firstly in the PAC mo del an Occam algorithm is a learning algorithm which

outputs a concept that is consistent with the examples as well as relatively sim

ple compared to those examples An ecient Occam algorithm can b e shown

to b e an ecient PAC algorithm Hence if a learning algorithm can suciently

compress the examples it will automatically learn PAC

Secondly the Minimum Description Length principle tells us to select the

simplest hyp othesis ie a hyp othesis which most compresses the data Under

certain mild assumptions this hyp othesis selection scheme can b e shown to b e

approximately equivalent to the optimal but infeasible hyp othesis selection that

is based on Bayess Theorem Accordingly under those assumptions Occams

Razor is an approximately optimal rule for hyp othesis selection

Thirdly and nally Kolmogorov complexity was originally intro duced in

order to give a universal metho d for prediction We are given an initial nite

binary sequence drawn according to some unknown computable probability

distribution and we are to predict how the sequence will continue We can pre

dict this by combining the predictions of al l computable distributions weighing

those predictions according to the Kolmogorov complexity of the distributions

Here distributions with low complexity are given high weight in accordance

with the Occamite preference for simplicity This Occambased prediction can

b e shown to converge very quickly to the true values

Conclusion

The main aim of this thesis has b een to examine the philosophical relevance

of recent results from the eld of computational learning theory The mo d

els of learning put forward in that eld are abstract and mathematical but

still capture much of what is imp ortant in real world learning Accordingly

they can b e used as mo dels of learning by human b eings and as mo d

els of inductive theory construction in the empirical sciences For the former

case the main application we discussed was a computational analysis of human

language learning for the latter we discussed various formal justications of

Occams Razor using Kolmogorov complexity as a measure of simplicity We

feel that these resultsas well as others that we discussed and others we did

not discussare highly relevant for philosophy and merit more attention than

they presently receive Let me end by expressing the hop e that this thesis

will contribute something to an increased awareness of formal learning theory

among philosophers

List of Symb ols

element

subset

prop er subset

sup erset

prop er sup erset

union

intersection

n set dierence

symmetric dierence

fx j C xg set of all x that satisfy condition C

empty set

jS j cardinality numb er of elements of set S

jr j absolute value of numb er r

S

p ower set set of all subsets of set S

S T Cartesian pro duct of sets S and T

S Cartesian pro duct of set S with itself

f S T function f with set S as domain and set T as range

N set of natural numb ers

Q set of rational numb ers

R set of real numb ers

innity

P

summation over all memb ers of set S

S

log logarithm with base

ln natural logarithm base e

empty string

S set of all nite strings over alphab et S

f g set of all nite binary strings

S set of all innite strings over alphab et S

f g set of all innite binary strings

rst n bits of innite binary string

n

n

X set of all strings of length at most n in domain X

n n

f pro jection of concept f on X

n n

F pro jection of concept class F on X

CHAPTER SUMMARY AND CONCLUSION

PA probability of event A

PAjB probability of A given B

condence parameter

error parameter

rate of malicious or random classication noise

l f R size shortest name of concept f in representation R

min

l S R size of smallest concept consistent with examples S in R

min

D VapnikChervonenkis dimension

VC

l x length of binary string x

l T length of shortest program for Turing machine T

K x Kolmogorov complexity of binary string x

K xjy Kolmogorov complexity of string x given string y

halting probability

m universal distribution

M universal semimeasure

Bibliography

Aar E Aarts Investigations in Logic Language and Computation PhD

thesis University of Utrecht

AB M Anthony and N Biggs Computational Learning Theory Cam

bridge University Press Cambridge UK

AL D Angluin and P Laird Learning from noisy examples Machine

Learning

All D Allp ort The changing relationship b etween AI program

ming languages and natural language pro cessing formalisms In

R Sp encerSmith and S Torrance editors Machinations Compu

tational Studies of Logic Language and Cognition pages

Ablex Publishing Corp oration Norwo o d NJ

Anga D Angluin Learning k b ounded contextfree grammars Technical

Rep ort YALEUDCSRR Department of Computer Science

Yale University

Angb D Angluin Learning regular sets from queries and counterexam

ples Information and Computation

Ang D Angluin Queries and concept learning Machine Learning

Ari Aristotle Posterior Analytics Harvard University Press Cam

bridge MA Edited and translated by Hugh Tredennick

Ari R Ariew Ockhams Razor A Historical and Philosophical Analysis

of Ockhams Principle of Parsimony PhD thesis University of

Illinois UrbanaChampaign

Bac J Bacon Four mo dal mo dellings Journal of Philosophical Logic

Bac F Bacon Novum Organum Op en Court Chicago IL Edited

and translated by P Urbach and J Gibson First published in

BC H Brandt Corstius Algebrasche Taalkunde Oostho ek Utrecht

In Dutch

BIBLIOGRAPHY

BEHW A Blumer A Ehrenfeucht D Haussler and M Warmuth Oc

cams razor Information Processing Letters

BEHW A Blumer A Ehrenfeucht D Haussler and M Warmuth Learn

ability and the VapnikChervonenkis dimension Journal of the

ACM

BGT R Boyd P Gasp er and J Trout editors The Philosophy of Sci

ence MIT Press Cambridge MA

BJ G S Bo olos and R C Jerey Computability and Logic Cambridge

University Press Cambridge UK third edition

Boa G Boas Some assumptions of Aristotle Transactions of the Amer

ican Philosophical Society N S

Bot R P Botha Chal lenging Chomsky Basil Blackwell

Bun M Bunge The complexity of simplicity Journal of Philosophy

Bur E A Burtt The Metaphysical Foundations of Modern Science

Routledge London revised edition First edition

Cam K Campb ell Abstract Particulars Basil Blackwell Oxford

Car R Carnap Logical Foundations of Probability Routledge Kegan

Paul London

Car R Carnap The Continuum of Inductive Methods The University

of Chicago Press Chicago IL

Cha G J Chaitin On the length of programs for computing nite

binary sequences Journal of the ACM

Cha G J Chaitin On the length of programs for computing nite

binary sequences Statistical considerations Journal of the ACM

Cha G J Chaitin Informationtheoretic limitations of formal systems

Journal of the ACM

Cha G J Chaitin Randomness and mathematical pro of Scientic

American May

Cha G J Chaitin Algorithmic Information Theory Cambridge Uni

versity Press Cambridge UK

Che C Cherniak Minimal Rationality MIT Press Cambridge MA

Cho N Chomsky Syntactic Structures Mouton The Hague

BIBLIOGRAPHY

Cho N Chomsky Aspects of the Theory of Syntax MIT Press Cam

bridge MA

Cho N Chomsky Quines empirical assumptions In Davidson and

Hintikka DH pages

Cho N Chomsky Lectures on Government and Binding Fortis Dor

drecht

Cho N Chomsky On cognitive structures and their development A

reply to Piaget In PiattelliPalmarini PP

Cho N Chomsky Know ledge of Language Its Nature Origin and Use

Praeger

Cho N Chomsky Linguistics and cognitive science Problems and mys

teries In Kasher Kas pages

Coh I B Cohen Revolution in Science BelknapHarvard University

Press Cambridge MA and London

CT T Cover and J Thomas Elements of Information Theory Wiley

New York

Den D C Dennett Real patterns Journal of Philosophy

Der W Derkse On Simplicity and Elegance An Essay in Intel lectual

History PhD thesis University of Amsterdam

DH D Davidson and J Hintikka editors Words and Objections Es

says on the Work of W V Quine Reidel revised edition

Dir P A M Dirac The evolution of the physicists picture of the world

Scientic American

GJ M R Garey and D S Johnson Computers and Intractability A

Guide to the Theory of NPCompleteness Freeman New York

GKPS G Gazdar E Klein G Pullum and I Sag Generalized Phrase

Structure Grammar Basil Blackwell Oxford

God K Godel Ub er formal unentscheidbare Satze der Principia Math

ematica und verwandter Systeme I Monatshefte fur Mathematik

und Physik

Gol E M Gold Language identication in the limit Information and

Control

Go oa N Go o dman Problems and Projects BobbsMerrill Indianap olis

and New York

BIBLIOGRAPHY

Go ob N Go o dman Safety strength simplicity In Problems and Projects

Go oa pages First published in

Go oc N Go o dman The test of simplicity In Problems and Projects

Go oa pages First published in

Go od N Go o dman Uniformity and simplicity In Problems and Projects

Go oa pages First published in

Go o N Go o dman Fact Fiction and Forecast Harvard University

Press Cambridge MA fourth edition

Har R A Harris The Linguistics Wars Oxford University Press New

York

Haw J A Hawkins editor Explaining Language Universals Basil Black

well Oxford

Hema C G Hemp el Studies in the logic of conrmation part I Mind

Hemb C G Hemp el Studies in the logic of conrmation part I I Mind

Hem C G Hemp el Philosophy of Natural Science PrenticeHall En

glewo o d Clis NJ

Hes M B Hesse Simplicity In P Edwards editor The Encyclopedia

of Philosophy pages Volume Macmillan London

Hof D R Hofstadter Godel Escher Bach An Eternal Golden Braid

Basic Bo oks New York

Hom Homer The Iliad Harvard University Press Cambridge MA

Translated by A T Murray Two volumes

HU J E Hop croft and J D Ullman Introduction to Automata Theory

Languages and Computation Addison Wesley Reading MA

Hum D Hume An Enquiry Concerning Human Understanding Gateway

edition Chicago IL First published in

Hum D Hume A Treatise of Human Nature Dolphin Bo oks Doubleday

First published in

Ish H Ishizaka Polynomial time learnability of simple deterministic

languages Machine Learning

Jev W S Jevons The Principles of Science A Treatise Macmillan

London

Kas A Kasher editor The Chomskyan Turn Basil Blackwell

BIBLIOGRAPHY

Kol A N Kolmogorov Three approaches to the quantitative denition

of information Problems of Information Transmission

Kol A N Kolmogorov Logical basis for information theory and prob

ability theory IEEE Transactions on Information Theory IT

Kuh T Kuhn Second thoughts on paradigms In F Supp e editor

The Structure of Scientic Theories pages University of

Illinois Press second edition

KV M J Kearns and U V Vazirani An Introduction to Computational

Learning Theory MIT Press Cambridge MA

Lai P D Laird Learning from Good and Bad Data Kluwer Academic

Publishers Boston MA

Lak I Lakatos History of science and its rational reconstructions In

R C Buck and R S Cohen editors Philosophy of Science As

sociation volume of Boston Studies in the Philosophy of

Science pages Reidel Dordrecht

Lam M van Lambalgen Algorithmic information theory Journal of

Symbolic Logic

Lo c J Lo cke An Essay Concerning Human Understanding Every

man Library David Campb ell Publishers London abridged edi

tion First published in

Lok G J C Lokhorst Is de mens een eindige automaat In F Geraedts

and L de Jong editors Ergo Cogito III Portret van een generatie

pages Historische Uitgeverij Groningen

LV M Li and P M B Vitanyi An Introduction to Kolmogorov Com

plexity and its Applications SpringerVerlag Berlin second edition

Lyo W Lyons editor Modern Philosophy of Mind Everyman London

McA J W McAllister Beauty Revolution in Science Cornell Univer

sity Press Ithaca NY and London

Mil J S Mill A System of Logic Ratiocinative and Inductive Harp er

New York

Min M L Minsky editor Semantic Information Processing MIT Press

Cambridge MA

ML P MartinLof The denition of random sequences Information

and Control

BIBLIOGRAPHY

Nat B K Natara jan Machine Learning A Theoretical Approach Mor

gan Kaufmann San Mateo CA

New F J Newmeyer Rules and principles in the historical development

of generative syntax In Kasher Kas pages

NW SH NienhuysCheng and R de Wolf Foundations of Inductive

Logic Programming volume of Lecture Notes in Articial In

tel ligence SpringerVerlag Berlin May

Pai A Pais Subtle is the Lord The Science and the Life of Albert

Einstein Oxford University Press Oxford

Pas J A Passmore A Hundred Years of Philosophy Penguin second

edition

Pei C S Peirce Col lected Papers Harvard University Press Cam

bridge MA Edited by C Harstshorne and P Weiss Volumes

IVI I

Pen R Penrose The Emperors New Mind Concerning Computers

Minds and the Laws of Physics Oxford University Press Oxford

Pin S Pinker Language Learnability and Language Development Har

vard University Press Cambridge MA The edition

contains new commentary by Pinker

Pin S Pinker The Language Instinct Penguin Bo oks

Pla U T Place Is consciousness a brain pro cess In Lyons Lyo

pages First published in

Pop K R Popp er The Logic of Scientic Discovery Hutchinson Lon

don

Pos P M Postal Limitations of phrase structure grammars In J A

Fo dor and J J Katz editors The Structure of Language Readings

in the Philosophy of Language pages Prentice Hall

PP M PiattelliPalmarini editor Language and Learning The De

bate Between Jean Piaget and Noam Chomsky Routledge London

PR S Peters and R Ritchie On the generative p ower of transforma

tional grammars Information Sciences

PS A Prince and P Smolensky Optimality From neural networks to

universal grammar Science March

Put H Putnam The innateness hyp othesis and explanatory mo dels

in linguistics In J Searle editor The Philosophy of Language

Oxford University Press New York

BIBLIOGRAPHY

Put H Putnam What is innate and why Comments on the debate In

PiattelliPalmarini PP pages

Qui W V O Quine Word and Object MIT Press Cambridge MA

Qui W V O Quine Natural kinds In Ontological Relativity and Other

Essays pages Columbia University Press New York

Reprinted in BGT pages

Qui W V O Quine Reply to Chomsky In Davidson and Hintikka

DH pages

Qui W V O Quine On simple theories of a complex world In The

Ways of Paradox and Other Essays pages Harvard Uni

versity Press Cambridge MA revised and enlarged edition

Rei H Reichenbach The Theory of Probability University of California

Press Berkeley

Ris J Rissanen Mo deling by shortest data description Automatica

Ris J J Rissanen Stochastical Complexity and Statistical Inquiry vol

ume of Series in Computer Science World Scientic Singap ore

Ros H Rosemont Gathering evidence for linguistic innateness Syn

these

Rus B Russell Human Know ledge Its Scope and Limits George Allen

and Unwin London

Rus B Russell The Problems of Philosophy Oxford University Press

Oxford First published in

Rus S Russell Inductive learning by machines Philosophical Studies

Sei M S Seidenb erg Language acquisition and use Learning and ap

plying probabilistic constraints Science March

Sha C E Shannon A mathematical theory of communication Bel l

Systems Technology Journal

Slo R H Sloan Four typ es of noise in PAC learning Information

Processing Letters

Sma J J C Smart Sensations and brain pro cesses In Lyons Lyo

pages First published in

BIBLIOGRAPHY

Sob E Sob er Simplicity Clarendon Press Oxford

Sol R J Solomono A formal theory of inductive inference part

and Information and Control

Tha P Thagard Philosophy and machine learning Canadian Journal

of Philosophy

Tho W M Thornburn The myth of Ockhams razor Mind

Tur A M Turing On computable numb ers with an application to the

Entscheidungproblem In Proceedings of the London Mathemati

cal Society volume pages Correction ibidem

vol pages

Val L G Valiant A theory of the learnable Communications of the

ACM

Val L G Valiant Learning disjunctions of conjunctions In Proceedings

of the th International Joint Conference on Articial Intel ligence

IJCAI pages Morgan Kaufmann

VC V N Vapnik and A Y Chervonenkis On the uniform convergence

of relative frequencies of events to their probabilities Theory of

Probability and its Applications

Wei J Weinb erg A Short History of Medieval Philosophy Princeton

University Press Princeton NJ

Wil D C Williams Principles of Empirical Realism Charles

C Thomas Springeld

Wit L Wittgenstein Philosophische Untersuchungen Suhrkamp

Edition published together with Tractatus and Tagebuc her

Index of Names

Aarts E Galileo G

Allp ort D Garey M n

Angluin D Gazdar G

Anthony M

Godel K

Ariew R

Gold E M

Aristotle

Go o dman N

Bacon F

Harris R n

Bacon J

Haussler D

Bayes T

Hawkins J n

Bayle P n

Hemp el C

Biggs N

Hesse M n

Blumer A

Hofstadter D

Boas G n

Homer

Bo olos G n

Hop croft J n

Botha R n

Hume D

Brandt Corstius H

Ishizaka H

Bunge M n

Burtt E

Jerey R n

Jevons S

Campb ell K

Johnson D n

Carnap R n

Chaitin G

Kearns M n

Cherniak C v

n

Chervonenkis A

Keats J n

Chomsky N vi n n

Kepler J

Klein E

Church A

Kolmogorov A

Cohen I B

Kuhn T n

Cop ernicus N

Cover T

Laird P

Lakatos I

Dennett D

Lambalgen M van

Derkse W

Laplace P

Descartes R

Leibniz G W

Dirac P

Li M n n

Duns Scotus

Lo cke J

Ehrenfeucht A

Einstein A Lokhorst GJ n

INDEX OF NAMES

MartinLof P Thomas J

McAllister J v Thornburn W

Mill J S Turing A

Minsky M

Ullman J n

Mises R von

Mondriaan P

Valiant L v

Vapnik V

Natara jan B n

Vazirani U n

Newmeyer F n

n

Newton I

Vitanyi P n n

Occam W of

Pais A

Wald A

Passmore J

Warmuth M

Peirce C S

Weinb erg J

Penrose R

Williams D C

Peters S

Wittgenstein L n n

Pinker S n n

Place U T

Popp er K

Postal P

Prince A n

Ptolemy

Pullum G

Putnam H

Quine W V O

Reichenbach H

Rissanen J

Ritchie R

Rosemont H

Russell B

Russell S v

Sag I

Satie E

Schlick M

Seidenb erg M n

Shannon C

Skinner B F

Sloan R

Smart J

Smolensky P n

Sob er E n

Solomono R

Thagard P v

Index of Sub jects

contextfree grammar see Typ e a p osteriori probability

grammar

a priori probability

contextsensitive grammar see Typ e

adversarial noise

grammar

aesthetic canons

agnosticism

decidable set

AI see Articial Intelligence

deep structure

algorithm

deterministic nite automaton

approximately correct

art

DFA see deterministic nite automa

Articial Intelligence v

ton

atheism

diagonal lemma

autological

Diophantine equation

axiomatization

domain

dualism

Bayess Theorem

b eauty

elegance

b ehaviorism

empiricism

binary expansion

enco ded Turing machine

entropy

Chomsky hierarchy

enumerable function

Chomsky normal form n

equivalence query

ChurchTuring thesis

error parameter n

coenumerable function example

extended equivalence query

COLT see computational learning

theory

Fermats last theorem n

comp etence

tting

complete axiomatization

formal language

complete pro of pro cedure n

complexity theory v

Godel numb er

compression

Godels Theorem

computable function

Goldbachs conjecture

computational learning theory v

grammar

Greibach normal form

concept

Grellings paradox

concept class

concept learning

halting probability

condence parameter n

consistent with examples is enumerable

INDEX OF SUBJECTS

is random Machine Learning v

halting problem malicious noise

heterological materialism

Maximum Likeliho o d n

identication from equivalence queries

MDL see Minimum Description Length

measure

induction

memb ership query

inductive logic programming vi

message

information

mindbrain

innateness

Minimum Description Length

intuitive biology

Mo ores law

intuitive mechanics

random

k b ounded grammar

Natural Grammars n

k recursive language

natural language

knowhow knowledge

Natural Languages

Kolmogorov complexity vi

negative example

n

neural network n n

noise

and information theory

nominalism

is coenumerable

nonterminal

is not recursive

nonterminal memb ership queries

ob jectivity of

NP completeness n

lab el of an example

language

Occam algorithm

language faculty

Occams Razor vi

language instinct

language learning vi

aesthetical interpretation

nite classes

and prediction

Typ e languages

and universal distribution

Typ e languages

and universal semimeasure

Typ e languages

as MDL principle

Typ e languages

formulations of

Typ e languages

in PAC learning

language of arithmetic

in philosophy examples

language organ

in science examples

learning v

justications for

length of a name

metho dological interpretation

length of an example

length parameter

ontological interpretation

lexicon

liars paradox

ontological simplicity n

linguistic bias

ontology

oracle

linguistics n

logical p ositivism P random

INDEX OF SUBJECTS

pro jection of a concept or concept Ptest

class

PAC algorithm

prop ositional logic

admissible

randomized

query

PAC learning v

and eciency

random classication noise

of formal languages

random numb er generator n

with noise

randomness

PAC predicting

and compression

PAC prediction algorithm

deciency n

partial recursive function

in mathematics

of nite strings p erformance

of innite strings

philosophy of mind

rationalism

philosophy of science

real function

phrase structure rule

recursive function

p olynomial

recursive language

p olynomial VCdimension

recursive set

p olynomialsample PAC learnable

recursively enumerable language

recursively enumerable set

p olynomialtime tting

regular grammar see Typ e gram

p olynomialtime identication from

mar

equivalence queries

representation

implies p olynomialtime PAC learn

evaluable

ability

p olynomially evaluable

p olynomialtime Occam algorithm

RichardBerry paradox

p olynomialtime PAC learnable

sample complexity

n

science

p olynomialtime PAC predictable

p ossibility of

n

scientic revolution

p ositive example

selfconsciousness

prediction

selfreference

not equal to hyp othesis selec

semantics

tion

semimeasure

prexfree

semiotic simplicity n

principle of parsimony or simplicity

sentence

see Occams Razor

sequential test

principlesandparameters theory

shatters

simple deterministic grammar

probability distribution

simplicity vi

probably approximately correct learn

size of a concept

ing see PAC learning

sound axiomatization

pro duction

Standard Mo del of arithmetic

program for a universal machine surface structure

symmetric dierence

INDEX OF SUBJECTS

syntax

target concept

terminal

test for randomness

time complexity

transformationalgenerative grammar

traveling salesman problem

trop e theory

Turing machine n

Typ e grammar

Typ e grammar

not p olynomially evaluable

Typ e grammar

p olynomially evaluable n

Typ e grammar n

Typ e grammar n

underdetermination of theory by data

uniform measure

universal Ptest

universal distribution m

Universal Grammar n

universal semimeasure M

universal sequential test

universal Turing machine

VapnikChervonenkis dimension see

VCdimension

VCdimension

Wiener Kreis