Grammatical Inference and First Language Acquisition

Alexander Clark ([email protected]) ISSCO / TIM, University of Geneva UNI-MAIL, Boulevard du Pont-d’Arve, CH-1211 Gene`ve 4, Switzerland

Abstract stated has no bounds on the amount of data or computation required for the learner. In spite of the inap- One argument for parametric models of language plicability of this particular paradigm, in a suitable has been learnability in the context of first language analysis there are quite strong arguments that bear acquisition. The claim is made that “logical” ar- directly on this problem. guments from learnability theory require non-trivial Grammatical inference is the study of machine constraints on the class of languages. Initial formal- learning of formal languages. It has a vast formal isations of the problem (Gold, 1967) are however vocabulary and has been applied to a wide selec- inapplicable to this particular situation. In this pa- tion of different problems, where the “languages” per we construct an appropriate formalisation of the under study can be (representations of) parts of nat- problem using a modern vocabulary drawn from sta- ural languages, sequences of nucleotides, moves of tistical learning theory and grammatical inference a robot, or some other sequence data. For any con- and looking in detail at the relevant empirical facts. clusions that we draw from formal discussions to We claim that a variant of the Probably Approxi- have any applicability to the real world, we must mately Correct (PAC) learning framework (Valiant, be sure to select, or construct, from the rich set of 1984) with positive samples only, modified so it is formal devices available an appropriate formalisa- not completely distribution free is the appropriate tion. Even then, we should be very cautious about choice. Some negative results derived from crypto- making inferences about how the infant child must graphic problems (Kearns et al., 1994) appear to ap- or cannot learn language: subsequent developments ply in this situation but the existence of algorithms in GI might allow a more nuanced description in with provably good performance (Ron et al., 1995) which these conclusions are not valid. The situation and subsequent work, shows how these negative re- is complicated by the fact that the field of grammti- sults are not as strong as they initially appear, and cal inference, much like the wider field of machine that recent algorithms for learning regular languages learning in general, is in a state of rapid change. partially satisfy our criteria. We then discuss the applicability of these results to parametric and non- In this paper we hope to address this problem by parametric models. justifying the selection of the appropriate learning framework starting by looking at the actual situa- 1 Introduction tion the child is in, rather than from an a priori deci- sion about the right framework. We will not attempt For some years, the relevance of formal results a survey of grammatical inference techniques; nor in grammatical inference to the empirical question shall we provide proofs of the theorems we use here. of first language acquisition by infant children has Arguments based on formal learnability have been been recognised (Wexler and Culicover, 1980). Un- used to support the idea of parameter based theo- fortunately, for many researchers, with a few no- ries of language (Chomsky, 1986). As we shall see table exceptions (Abe, 1988), this begins and ends below, under our analysis of the problem these ar- with Gold’s famous negative results in the identifi- guments are weak. Indeed, they are more pertinent cation in the limit paradigm. This paradigm, though to questions about the autonomy and modularity of still widely used in the grammatical inference com- language learning: the question whether learning of munity, is clearly of limited relevance to the issue some level of linguistic knowledge – morphology at hand, since it requires the model to be able to or syntax, for example – can take place in isolation exactly identify the target language even when an from other forms of learning, such as the acquisition adversary can pick arbitrarily misleading sequences of word meaning, and without interaction, ground- of examples to provide. Moreover, the paradigm as ing and so on.

25 Positive results can help us to understand how hu- plained. Thus, when looking for an appropriate for- mans might learn languages by outlining the class of mulation of the problem, we should recall for exam- algorithms that might be used by humans, consid- ple the fact that different children do not converge to ered as computational systems at a suitable abstract exactly the same knowledge of language as is some- level. Conversely, negative results might be help- times claimed, nor do all of them acquire a language ful if they could demonstrate that no algorithms of a competently at all, since there is a small proportion certain class could perform the task ± in this case we of children who though apparently neurologically could know that the human child learns his language normal fail to acquire language. In the context of in some other way. our discussion later on, these observations lead us We shall proceed as follows: after brie¯y de- to accept slightly less stringent criteria where we al- scribing FLA, we describe the various elements of low a small probability of failure and do not demand a model of learning, or framework. We then make perfect equality of hypothesis and target. a series of decisions based on the empirical facts about FLA, to construct an appropriate model or 3 Grammatical Inference models, avoiding unnecessary idealisation wherever The general ®eld of machine learning has a spe- possible. We proceed to some strong negative re- cialised sub®eld that deals with the learning of for- sults, well-known in the GI community that bear on mal languages. This ®eld, Grammatical Inference the questions at hand. The most powerful of these (GI), is characterised above all by an interest in for- (Kearns et al., 1994) appears to apply quite directly mal results, both in terms of formal characterisa- to our chosen model. We then discuss an interest- tions of the target languages, and in terms of formal ing algorithm (Ron et al., 1995) which shows that proofs either that particular algorithms can learn ac- this can be circumvented, at least for a subclass of cording to particular de®nitions, or that sets of lan- regular languages. Finally, after discussing the pos- guage cannot be learnt. In spite of its theoretical sibilities for extending this result to all regular lan- bent, GI algorithms have also been applied with guages, and beyond, we conclude with a discussion some success. Natural language, however is not the of the implications of the results presented for the only source of real-world applications for GI. Other distinction between parametric and non-parametric domains include biological sequence data, arti®cial models. languages, such as discovering XML schemas, or sequences of moves of a robot. The ®eld is also 2 First Language Acquisition driven by technical motives and the intrinsic ele- Let us ®rst examine the phenomenon we are con- gance and interest of the mathematical ideas em- cerned with: ®rst language acquisition. In the space ployed. In summary it is not just about language, of a few years, children almost invariably acquire, and accordingly it has developed a rich vocabulary in the absence of explicit instruction, one or more of to deal with the wide range of its subject matter. the languages that they are exposed to. A multitude In particular, researchers are often concerned of subsidiary debates have sprung up around this with formal results ± that is we want algorithms central issue covering questions about critical peri- where we can prove that they will perform in a cer- ods ± the ages at which this can take place, the ex- tain way. Often, we may be able to empirically es- act nature of the evidence available to the child, and tablish that a particular algorithm performs well, in the various phases of linguistic use through which the sense of reliably producing an accurate model, the infant child passes. In the opinion of many re- while we may be unable to prove formally that the searchers, explaining this ability is one of the most algorithm will always perform in this way. This important challenges facing linguists and cognitive can be for a number of reasons: the mathematics scientists today. required in the derivation of the bounds on the er- A dif®culty for us in this paper is that many of rors may be dif®cult or obscure, or the algorithm the idealisations made in the study of this ®eld are may behave strangely when dealing with sets of data in fact demonstrably false. Classical assumptions, which are ill-behaved in some way. such as the existence of uniform communities of The basic framework can be considered as a language users, are well-motivated in the study of game played between two players. One player, the the ªsteady stateº of a system, but less so when teacher, provides information to another, the learner, studying acquisition and change. There is a regret- and from that information the learner must identify table tendency to slip from viewing these idealisa- the underlying language. We can break down this tions correctly ± as counter-factual idealizations ± to situation further into a number of elements. We as- viewing them as empirical facts that need to be ex- sume that the languages to be learned are drawn

26 in some way from a possibly in®nite class of lan- must perform. We consider that some of the em- guages, L, which is a set of formal mathematical pirical questions do not yet have clear answers. In objects. The teacher selects one of these languages, those cases, we shall make the choice that makes the which we call the target, and then gives the learner learning task more dif®cult. In other cases, we may a certain amount of information of various types not have a clear idea of how to formalise some in- about the target. After a while, the learner then re- formation source. We shall start by making a signif- turns its guess, the hypothesis, which in general will icant idealisation: we consider language acquisition be a language drawn from the same class L. Ide- as being a single task. Natural languages as tradi- ally the learner has been able to deduce or induce tionally describe have different levels. At the very or abduce something about the target from the in- least we have morphology and syntax; one might formation we have given it, and in this case the hy- also consider inter-sentential or discourse as an ad- pothesis it returns will be identical to, or close in ditional level. We con¯ate all of these into a single some technical sense, to the target. If the learner task: learning a formal language; in the discussion can conistently do this, under whatever constraints below, for the sake of concreteness and clarity, we we choose, then we say it can learn that class of lan- shall talk in terms of learning syntax. guages. To turn this vague description into something more concrete requires us to specify a number 4.1 The Language of things. The ®rst question we must answer concerns the lan- • What sort of mathematical object should we guage itself. A formal language is normally de®ned use to represent a language? as follows. Given a ®nite alphabet Σ, we de®ne the set of all strings (the free monoid) over Σ as Σ∗. What is the target class of languages? ∗ • We want to learn a language L ⊂ Σ . The alpha- • What information is the learner given? bet Σ could be a set of phonemes, or characters, or a set of words, or a set of lexical categories (part • What computational constraints does the of speech tags). The language could be the set of learner operate under? well-formed sentences, or the set of words that obey • How close must the target be to the hypothesis, the phonotactics of the language, and so on. We re- and how do we measure it? duce all of the different learning tasks in language to a single abstract task ± identifying a possibly in- This paper addresses the extent to which negative ®nite set of strings. This is overly simplistic since results in GI could be relevant to this real world sit- transductions, i.e. mappings from one string to an- uation. As always, when negative results from the- other, are probably also necessary. We are using ory are being applied, a certain amount of caution here a standard de®nition of a language where every is appropriate in examining the underlying assump- string is unambiguously either in or not in the lan- tions of the theory and the extent to which these are guage.. This may appear unrealistic ± if the formal applicable. As we shall see, in our opinion, none language is meant to represent the set of grammati- of the current negative results, though powerful, are cal sentences, there are well-known methodological applicable to the empirical situation. We shall ac- problems with deciding where exactly to draw the cordingly, at various points, make strong pessimistic line between grammatical and ungrammatical sen- assumptions about the learning environment of the tences. An alternative might be to consider accept- child, and show that even under these unrealistically ability rather than grammaticality as the de®ning stringent stipulations, the negative results are still criterion for inclusion in the set. Moreover, there inapplicable. This will make the conclusions we is a certain amount of noise in the input ± There come to a little sharper. Conversely, if we wanted are other possibilities. We could for example use a to show that the negative results did apply, to be fuzzy set ± i.e. a function from Σ∗ → [0, 1] where convincing we would have to make rather optimistic each string has a degree of membership between 0 assumptions about the learning environment. and 1. This would seem to create more problems than it solves. A more appealing option is to learn 4 Applying GI to FLA distributions, again functions f from Σ∗ → [0, 1] We now have the delicate task of selecting, or rather but where Ps∈L f(s) = 1. This is of course the constructing, a formal model by identifying the vari- classic problem of language modelling, and is com- ous components we have identi®ed above. We want pelling for two reasons. First, it is empirically well to choose the model that is the best representation grounded ± the probability of a string is related to its of the learning task or tasks that the infant child frequency of occurrence, and secondly, we can de-

27 duce from the speech recognition capability of hu- and erratic, not a universal of global child-raising. mans that they must have some similar capability. Furthermore this appears to have no effect on the Both possibilities ± crisp languages, and distri- child. Children do also get indirect pragmatic feed- butions ± are reasonable. The choice depends on back if their utterances are incomprehensible. In our what one considers the key phenomena to be ex- opinion, both of these would be better modelled by plained are ± grammaticality judgments by native what is called a membership query: the algorithm speakers, or natural use and comprehension of the may generate a string and be informed whether that language. We favour the latter, and accordingly string is in the language or not. However, we feel think that learning distributions is a more accurate that this is too erratic to be considered an essential and more dif®cult choice. part of the process. Another question is whether the input data is presented as a ¯at string or annotated 4.2 The class of languages with some sort of structural evidence, which might A common confusion in some discussions of this be derived from prosodic or semantic information. topic is between languages and classes of lan- Unfortunately there is little agreement on what the guages. Learnability is a property of classes of constituent structure should be ± indeed many lin- languages. If there is only one language in the guistic theories do not have a level of constituent class of languages to be learned then the learner structure at all, but just dependency structure. can just guess that language and succeed. A class Semantic information is also claimed as an im- with two languages is again trivially learnable if portant source. The hypothesis is that children can you have an ef®cient algorithm for testing member- use lexical semantics, coupled with rich sources of ship. It is only when the set of languages is expo- real-world knowlege to infer the meaning of utter- nentially large or in®nite, that the problem becomes ances from the situational context. That would be non-trivial, from a theoretical point of view. The an extremely powerful piece of information, but it is class of languages we need is a class of languages clearly absurd to claim that the meaning of an utter- that includes all attested human languages and ad- ance is uniquely speci®ed by the situational context. ditionally all ªpossibleº human languages. Natu- If true, there would be no need for communication ral languages are thought to fall into the class of or information transfer at all. Of course the context mildly context-sensitive languages, (Vijay-Shanker puts some constraints on the sentences that will be and Weir, 1994), so clearly this class is large uttered, but it is not clear how to incorporate this enough. It is, however, not necessary that our class fact without being far too generous. In summary it be this large. Indeed it is essential for learnability appears that only positive evidence can be unequiv- that it is not. As we shall see below, even the class ocally relied upon though this may seem a harsh and of regular languages contains some subclasses that unrealistic environment. are computationally hard to learn. Indeed, we claim it is reasonable to de®ne our class so it does not con- 4.4 Presentation tain languages that are clearly not possible human languages. We have now decided that the only evidence available to the learner will be unadorned positive sam- 4.3 Information sources ples drawn from the target language. There are var- Next we must specify the information that our learn- ious possibilities for how the samples are selected. ing algorithm has access to. Clearly the primary The choice that is most favourable for the learner is source of data is the primary linguistic data (PLD), where they are slected by a helpful teacher to make namely the utterances that occur in the child's envi- the learning process as easy as possible (Goldman ronment. These will consist of both child-directed and Mathias, 1996). While it is certainly true that speech and adult-to-adult speech. These are gen- carers speak to small children in sentences of sim- erally acceptable sentences that is to say sentences ple structure (Motherese), this is not true for all of that are in the language to be learned. These are the data that the child has access to, nor is it uni- called positive samples. One of the most long- versally valid. Moreover, there are serious techni- running debates in this ®eld is over whether the cal problems with formalising this, namely what is child has access to negative data ± unacceptable sen- called 'collusion' where the teacher provides exam- tences that are marked in some way as such. The ples that encode the grammar itself, thus trivialising consensus (Marcus, 1993) appears to be that they do the learning process. Though attempts have been not. In middle-class Western families, children are made to limit this problem, they are not yet com- provided with some sort of feedback about the well- pletely satisfactory. The next alternative is that the formedness of their utterances, but this is unreliable examples are selected randomly from some ®xed

28 distribution. This appears to us to be the appropri- the interpretation and syntactic acceptability of sim- ate choice, subject to some limitations on the dis- ple sentences, quite apart from the wide purely lex- tributions that we discuss below. The ®nal option, ical variation that is easily detected. A famous ex- the most dif®cult for the learner, is where the se- ample in English is ªEach of the boys didn't comeº. quence of samples can be selected by an intelli- Moreover, language change requires some chil- gent adversary, in an attempt to make the learner dren to end up with slightly different grammars fail, subject only to the weak requirement that each from the older generation. At the very most, we string in the language appears at least once. This is should require that the hypothesis should be close the approach taken in the identi®cation in the limit to the target. The function we use to measure the paradigm (Gold, 1967), and is clearly too stringent. 'distance' between hypothesis and target depends on The remaining question then regards the distribu- whether we are learnng crisp languages or distribution from which the samples are drawn: whether the tions. If we are learning distributions then the ob- learner has to be able to learn for every possible dis- vious choice is the Kullback-Leibler divergence ± a tribution, or only for distributions from a particular very strict measure. For crisp languages, the prob- class, or only for one particular distribution. ability of the symmetric difference with respect to 4.5 Resources some distribution is natural. Beyond the requirement of computability we will 4.7 PAC-learning wish to place additional limitations on the computational resources that the learner can use. Since chil- These considerations lead us to some variant of the dren learn the language in a limited period of time, Probably Approximately Correct (PAC) model of which limits both the amount of data they have ac- learning (Valiant, 1984). We require the algorithm cess to and the amount of computation they can use, to produce with arbitrarily high probability a good it seems appropriate to disallow algorithms that use hypothesis. We formalise this by saying that for any unbounded or very large amounts of data or time. δ > 0 it must produce a good hypothesis with prob- As normal, we shall formalise this by putting poly- ability more than 1 − δ. Next we require a good nomial bounds on the sample complexity and com- hypothesis to be arbitrarily close to the target, so we putational complexity. Since the individual samples have a precision and we say that for any > 0, the are of varying length, we need to allow the compu- hypothesis must be less than away from the target. tational complexity to depend on the total length of We allow the amount of data it can use to increase as the sample. A key question is what the parameters the con®dence and precision get smaller. We de®ne of the sample complexity polynomial should be. We PAC-learning in the following way: given a ®nite shall discuss this further below. alphabet Σ, and a class of languages L over Σ, an algorithm PAC-learns the class L, if there is a poly- 4.6 Convergence Criteria nomial q, such that for every con®dence δ > 0 and Next we address the issue of reliability: the extent precision > 0, for every distribution D over Σ∗, to which all children acquire language. First, vari- for every language L in L, whenever the number of ability in achievement of particular linguistic mile- samples exceeds q(1/, 1/δ, |Σ|, |L|), the algorithm stones is high. There are numerous causes including must produce a hypothesis H such that with prob- deafness, mental retardation, cerebral palsy, speci®c ability greater than 1 − δ, P rD(H∆L > ). Here language impairment and autism. Generally, autis- we use A∆B to mean the symmetric difference be- tic children appear neurologically and physically tween two sets. The polynomial q is called the normal, but about half may never speak. Autism, sample complexity polynomial. We also limit the on some accounts, has an incidence of about 0.2%. amount of computation to some polynomial in the Therefore we can require learning to happen with total length of the data it has seen. Note ®rst of all arbitrarily high probability, but requiring it to hap- that this is a worst case bound ± we are not requiring pen with probability one is unreasonable. A related merely that on average it comes close. Additionally question concerns convergence: the extent to which this model is what is called 'distribution-free'. This children exposed to a linguistic environment end means that the algorithm must work for every com- up with the same language as others. Clearly they bination of distribution and language. This is a very are very close since otherwise communication could stringent requirement, only mitigated by the fact not happen, but there is ample evidence from stud- that the error is calculated with respect to the same ies of variation (Labov, 1975), that there are non- distribution that the samples are drawn from. Thus, trivial differences between adults, who have grown if there is a subset of Σ∗ with low aggregate proba- up with near-identical linguistic experiences, about bility under D, the algorithm will not get many sam-

29 ples from this region but will not be penalised very function over negative examples or alternatively much for errors in that region. From our point of require the hypothesis to be a subset of the tar- view, there are two problems with this framework: get. Secondly, this de®nition is too vague. The ®rst, we only want to draw positive samples, but the exact way in which you extend the ªcrispº lan- distributions are over all strings in Σ∗, and include guage to a stochastic one can have serious con- some that give a zero probability to all strings in sequences. When dealing with regular languages, the language concerned. Secondly, this is too pes- for example, though the class of languages de®ned simistic because the distribution has no relation to by deterministic automata is the same as that de- the language: intuitively it's reasonable to expect ®ned by non-deterministic languages, the same is the distribution to be derived in some way from the not true for their stochastic variants. Additionally, language, or the structure of a grammar generating one can have exponential blow-ups in the number the language. Indeed there is a causal connection of states when determinising automata. Similarly, in reality since the sample of the language the child with CFGs, (Abney et al., 1999) showed that con- is exposed to is generated by people who do in fact verting between two parametrisations of stochastic know the language. Context Free languages are equivalent but that there One alternative that has been suggested is the are blow-ups in both directions. We do not have a PAC learning with simple distributions model intro- completely satisfactory solution to this problem at duced by (Denis, 2001). This is based on ideas from the moment; an alternative is to consider learning complexity theory where the samples are drawn ac- the distributions rather than the languages. cording to a universal distribution de®ned by the In the case of learning distributions, we have the conditional Kolmogorov complexity. While math- same framework, but the samples are drawn accord- ematically correct this is inappropriate as a model ing to the distribution being learned T , and we re- of FLA for a number of reasons. First, learnability quire that the hypothesis H has small divergence is proven only on a single very unusual distribution, from the target: D(T ||H) < . Since the divergence and relies on particular properties of this distribu- is in®nite if the hypothesis gives probability zero to tion, and secondly there are some very large con- a string in the target, this will have the consequence stants in the sample complexity polynomial. that the target must assign a non-zero probability to The solution we favour is to de®ne some natu- every string. ral class of distributions based on a grammar or automaton generating the language. Given a class of 5 Negative Results languages de®ned by some generative device, there Now that we have a fairly clear idea of various ways is normally a natural stochastic variant of the de- of formalising the situation we can consider the ex- vice which de®nes a distribution over that language. tent to which formal results apply. We start by con- Thus regular languages can be de®ned by a ®nite- sidering negative results, which in Machine Learn- state automaton, and these can be naturally ex- ing come in two types. First, there are information- tended to Probabilistic ®nite state automaton. Sim- theoretic bounds on sample complexity, derived ilarly context free languages are normally de®ned from the Vapnik-Chervonenkis (VC) dimension of by context-free grammmars which can be extended the space of languages, a measure of the complex- again to to Probabilistic or stochastic CFG. We ity of the set of hypotheses. If we add a parameter therefore propose a slight modi®cation of the PAC- to the sample complexity polynomial that represents framework. For every class of languages L, de®ned the complexity of the concept to be learned then this by some formal device de®ne a class of distribu- will remove these problems. This can be the size of tions de®ned by a stochastic variant of that device. a representation of the target which will be a poly- D. Then for each language L, we select the set of nomial in the number of states, or simply the num- distributions whose support is equal to the language ber of non-terminals or states. This is very standard and subject to a polynomial bound (q)on the com- in most ®elds of machine learning. plexity of the distribution in terms of the complex- + The second problem relates not to the amount ity of the target language: DL = {D ∈ D : L = of information but to the computation involved. supp(D) ∧ |D| < q(|L|)}. Samples are drawn from Results derived from cryptographic limitations on one of these distributions. computational complexity, can be proved based on There are two technical problems here: ®rst, this widely held and well supported assumptions that doesn't penalise over-generalisation. Since the dis- certain hard cryptographic problems are insoluble. tribution is over positive examples, negative exam- In what follows we assume that there are no ef®- ples have zero weight, so we need some penalty cient algorithms for common cryptographic prob-

30 lems such as factoring Blum integers, inverting RSA work can be extended to cyclic automata (Clark and function, recognizing quadratic residues or learning Thollard, 2004a; Clark and Thollard, 2004b), and noisy parity functions. thus the class of all regular languages, with the ad- There may be algorithms that will learn with rea- dition of a further parameter which bounds the ex- sonable amounts of data but that require unfeasibly pected length of a string generated from any state. large amounts of computation to ®nd. There are The use of distinguishability seems innocuous; in a number of powerful negative results on learning syntactic terms it is a consequence of the plausible in the purely distribution-free situation we consid- condition that for any pair of distinct non-terminals ered and rejected above. (Kearns and Valiant, 1989) there is some fairly likely string generated by one showed that acyclic deterministic automata are not and not the other. Similarly strings of symbols in learnable even with positive and negative exam- natural language tend to have limited length. An ples. Similarly, (Abe and Warmuth, 1992) showed alternate way of formalising this is to de®ne a class a slightly weaker representation dependent result on of distinguishable automata, where the distinguisha- learning with a large alphabet for non-deterministic bility of the automata is lower bounded by an in- automata, by showing that there are strings such that verse polynomial in the number of states. This is maximising the likelihood of the string is NP-hard. formally equivalent, but avoids adding terms to the Again this does not strictly apply to the partially dis- sample complexity polynomial. In summary this tribution free situation we have chosen. would be a valid solution if all human languages However there is one very strong result that ap- actually lay within the class of regular languages. pears to apply. A straightforward consequence of Note also the general properties of this kind of al- (Kearns et al., 1994) shows that Acyclic Determinis- gorithm: provably learning an in®nite class of lan- tic Probabilistic FSA over a two letter alphabet can- guages with in®nite support using only polynomial not be learned under another cryptographic assump- amounts of data and computation. tion (the noisy parity assumption). Therefore any It is worth pointing out that the algorithm does class of languages that includes this comparatively not need to ªknowº the values of the parameters. weak family will not be learnable in out framework. De®ne a new parameter t, and set, for example n = −t − − But this rests upon the assumption that the class t, L = t, δ = e , = t 1 and µ = t 1. This gives of possible human languages must include some a sample complexity polynomial in one parameter cryptographically hard functions. It appears that q(t). Given a certain amount of data N we can just our formal apparatus does not distinguish between choose the largest value of t such that q(t) < N, these cryptographic functions which hav been con- and set the parameters accordingly. sciously designed to be hard to learn, and natural languages which presumably have evolved to be 7 Parametric models easy to learn since there is no evolutionary pressure We can now examine the relevance of these re- to make them hard to decrypt ± no intelligent preda- sults to the distinction between parametric and non- tors eavesdropping for example. Clearly this is a parametric languages. Parametric models are those ¯aw in our analysis: we need to ®nd some more where the class of languages is parametrised by a nuanced description for the class of possible human small set of ®nite-valued (binary) parameters, where languages that excludes these hard languages or dis- the number of paameters is small compared to the tributions. log2 of the complexity of the languages. Without this latter constraint the notion is mathematically 6 Positive results vacuous, since, for example, any context free gram- There is a positive result that shows a way forward. mar in Chomsky normal form can be parametrised A PDFA is µ-distinguishable the distributions gen- with N 3 + NM + 1 binary parameters where N erated from any two states differ by at least µ in is the number of non-terminals and M the num- the L∞-norm, i.e. there is a string with a differ- ber of terminals. This constraint is also necessary ence in probability of at least µ. (Ron et al., 1995) for parametric models to make testable empirical showed that µ-distinguishable acyclic PDFAs can predictions both about language universals, devel- be PAC-learned using the KLD as error function opmental evidence and relationships between the in time polynomial in n, 1/, 1/δ, 1/µ, |Σ|. They two (Hyams, 1986). We neglect here the important use a variant of a standard state-merging algorithm. issue of lexical learning: we assume, implausibly, Since these are acyclic the languages they de®ne that lexical learning can take place completely be- are always ®nite. This additional criterion of distin- fore syntax learning commences. It has in the past guishability suf®ces to guarantee learnability. This been stated that the ®niteness of a language class

31 suf®ces to guarantee learnability even under a PAC- quisition and Learnability. Cambridge University learning criterion (Bertolo, 2001). This is, in gen- Press. eral, false, and arises from neglecting constraints on Noam Chomsky. 1986. Knowledge of Language : the sample complexity and the computational com- Its Nature, Origin, and Use. Praeger. plexities both of learning and of parsing. The neg- Alexander Clark and Franck Thollard. 2004a. ative result of (Kearns et al., 1994) discussed above PAC-learnability of probabilistic deterministic ®- applies also to parametric models. The speci®c class nite state automata. Journal of Machine Learning of noisy parity functions that they prove are unlearn- Research, 5:473±497, May. able, are parametrised by a number of binary pa- Alexander Clark and Franck Thollard. 2004b. Par- rameters in a way very reminiscent of a parametric tially distribution-free learning of regular lan- model of language. The mere fact that there are a guages from positive samples. In Proceedings of ®nite number of parameters does not suf®ce to guar- COLING, Geneva, Switzerland. antee learnability, if the resulting class of languages F. Denis. 2001. Learning regular languages from is exponentially large, or if there is no polynomial simple positive examples. Machine Learning, algorithm for parsing. This does not imply that all 44(1/2):37±66. parametrised classes of languages will be unlearn- E. M. Gold. 1967. Language indenti®cation in the able, only that having a small number of parame- limit. Information and control, 10(5):447 ± 474. ters is neither necessary nor suf®cient to guarantee S. A. Goldman and H. D. Mathias. 1996. Teach- ef®cient learnability. If the parameters are shallow ing a smarter learner. Journal of Computer and and relate to easily detectable properties of the lan- System Sciences, 52(2):255±267. guages and are independent then learning can oc- N. Hyams. 1986. Language Acquisition and the cur ef®ciently (Yang, 2002). If they are ªdeepº and Theory of Parameters. D. Reidel. inter-related, learning may be impossible. Learn- M. Kearns and G. Valiant. 1989. Cryptographic ability depends more on simple statistical properties limitations on learning boolean formulae and ®- of the distributions of the samples than on the struc- nite automata. In 21st annual ACM symposium ture of the class of languages. on Theory of computation, pages 433±444, New Our conclusion then is ultimately that the theory York. ACM, ACM. of learnability will not be able to resolve disputes M.J. Kearns, Y. Mansour, D. Ron, R. Rubinfeld, about the nature of ®rst language acquisition: these R.E. Schapire, and L. Sellie. 1994. On the learn- problems will have to be answered by empirical re- ability of discrete distributions. In Proc. of the search, rather than by mathematical analysis. 25th Annual ACM Symposium on Theory of Com- puting, pages 273±282. Acknowledgements W. Labov. 1975. Empirical foundations of linguis- This work was supported in part by the IST tic theory. In R. Austerlitz, editor, The Scope of Programme of the European Community, under American Linguistics. Peter de Ridder Press. the PASCAL Network of Excellence, IST-2002- G. F. Marcus. 1993. Negative evidence in language 506778, funded in part by the Swiss Federal Of®ce acquisition. Cognition, 46:53±85. for Education and Science (OFES). This publication D. Ron, Y. Singer, and N. Tishby. 1995. On the only re¯ects the authors' views. learnability and usage of acyclic probabilistic ®- nite automata. In COLT 1995, pages 31±40, References Santa Cruz CA USA. ACM. N. Abe and M. K. Warmuth. 1992. On the com- L. Valiant. 1984. A theory of the learnable. Com- putational complexity of approximating distribu- munications of the ACM, 27(11):1134 ± 1142. tions by probabilistic automata. Machine Learn- K. Vijay-Shanker and David J. Weir. 1994. ing, 9:205±260. The equivalence of four extensions of context- N. Abe. 1988. Feasible learnability of formal gram- free grammars. Mathematical Systems Theory, mars and the theory of natural language acquisi- 27(6):511±546. tion. In Proceedings of COLING 1988, pages 1± Kenneth Wexler and Peter W. Culicover. 1980. For- 6. mal Principles of Language Acquisition. MIT S. Abney, D. McAllester, and F. Pereira. 1999. Re- Press. lating probabilistic grammars and automata. In C. Yang. 2002. Knowledge and Learning in Natu- Proceedings of ACL '99. ral Language. Oxford. Stefano Bertolo. 2001. A brief overview of learnability. In Stefano Bertolo, editor, Language Ac-