Poverty of the Stimulus? A Rational Approach

Amy Perfors1 ([email protected]), Joshua B. Tenenbaum1 ([email protected]), and Terry Regier2 ([email protected]) 1Department of Brain and Cognitive Sciences, MIT; 2Department of , University of Chicago

Abstract complex since it operates over a sentence’s phrasal struc- ture and not just its sequence of elements. The Poverty of the Stimulus (PoS) argument holds that The “poverty” part of this form of the PoS argument children do not receive enough evidence to infer the exis- tence of core aspects of language, such as the dependence claims that children do not see the data they would need of linguistic rules on hierarchical phrase structure. We to in order to rule out the structure-independent (linear) reevaluate one version of this argument with a Bayesian hypothesis. An example of such data would be an in- model of grammar induction, and show that a rational terrogative sentence such as “Is the man who is hungry learner without any initial language-specific biases could ordering dinner?”. In this sentence, the main clause aux- learn this dependency given typical child-directed input. This choice enables the learner to master aspects of syn- iliary is fronted in spite of the existence of another aux- tax, such as the auxiliary fronting rule in interrogative iliary that would come first in the corresponding declar- formation, even without having heard directly relevant ative sentence. Chomsky argued that this type of data data (e.g., interrogatives containing an auxiliary in a is not accessible in child speech, maintaining that “it is relative clause in the subject NP). quite possible for a person to go through life without having heard any of the relevant examples that would Introduction choose between the two principles” (Chomsky, 1971). Modern was strongly influenced by Chomsky’s It is mostly accepted that children do not appear to observation that language learners make grammatical go through a period where they consider the linear hy- generalizations that do not appear justified by the ev- pothesis (Crain and Nakayama, 1987). However, two idence in the input (Chomsky, 1965, 1980). The no- other aspects of the PoS argument are the topic of much tion that these generalizations can best be explained debate. The first considers what evidence there is in by innate knowledge, known as the argument from the the input and what constitutes “enough” (Pullum and Poverty of the Stimulus (henceforth PoS), has led to an Scholz, 2002; Legate and Yang, 2002). Unfortunately, enduring debate that is central to many of the key issues this approach is inconclusive: while there is some agree- in cognitive science and linguistics. ment that the critical forms are rare in child-directed The original formulation of the Poverty of Stimulus ar- speech, they do occur (Legate and Yang, 2002; Pullum gument rests critically on assumptions about simplicity, and Scholz, 2002). Lacking a clear specification of how the nature of the input children are exposed to, and how a child’s language learning mechanism might work, it is much evidence is sufficient to support the generaliza- difficult to determine whether that input is sufficient. tions that children make. The phenomenon of auxiliary The second issue concerns the nature of the stimulus, fronting in interrogative sentences is one example of the suggesting that regardless of whether there is enough PoS argument; here, the argument states that children direct syntactic evidence available, there may be suf- must be innately biased to favor structure-dependent ficient distributional and statistical regularities in lan- rules that operate using grammatical constructs like guage to explain children’s behavior (Redington et al., phrases and clauses over structure-independent rules 1998; Lewis and Elman, 2001; Reali and Christiansen, that operate only on the sequence of words. 2004). Most of the work focusing specifically on aux- English interrogatives are formed from declaratives by iliary fronting uses connectionist simulations or n-gram fronting the main clause auxiliary. Given a declarative models to argue that child-directed language contains sentence like “The dog in the corner is hungry”, the in- enough information to predict the grammatical status of terrogative is formed by moving the is to make the sen- aux-fronted interrogatives (Reali and Christiansen, 2004; tence “Is the dog in the corner hungry?” Chomsky con- Lewis and Elman, 2001). sidered two types of operation that can explain auxiliary While both of these approaches are useful and the re- fronting (Chomsky, 1965, 1971). The simplest (linear) search on statistical learning in particular is promising, rule is independent of the hierarchical phrase structure there are still notable shortcomings. First of all, the sta- of the sentence: take the leftmost (first) occurrence of the tistical models do not engage with the primary intuition auxiliary in the sentence and move it to the beginning. and issue raised by the PoS argument. The intuition The structure-dependent (hierarchical) rule – move the is that language has a hierarchical structure – it uses auxiliary from the main clause of the sentence – is more symbolic notions like syntactic categories and phrases that are hierarchically organized within sentences, which Because this analysis takes place within an ideal learn- are recursively generated by a grammar. The issue is ing framework, we assume that the learner is able to ef- whether knowledge about this structure is learned or in- fectively search over the joint space of G and T for gram- nate. An approach that lacks an explicit representa- mars that maximize the Bayesian scoring criterion. We tion of structure has two problems addressing this issue. do not focus on the question of whether the learner can First of all, many linguists and cognitive scientists tend successfully search the space, instead presuming that an to discount these results because they ignore a principal ideal learner can learn a given G, T pair if it has a higher feature of linguistic knowledge, namely that it is based score than the alternatives. Because we only compare on structured symbolic representations. Secondly, con- grammars that can parse our corpus, we first consider nectionist networks and n-gram models tend to be diffi- the corpus before explaining the grammars. cult to understand analytically. For instance, the mod- els used by Reali and Christiansen (2004) and Lewis and The corpus Elman (2001) measure success by whether they predict the next word in a sequence, rather than based on ex- The corpus consists of the sentences spoken by adults in amination of an explicit grammar. Though the models the Adam corpus (Brown, 1973) in the CHILDES data- perform above chance, it is difficult to tell why and what base (MacWhinney, 2000). In order to focus on gram- precisely they have learned. mar learning rather than lexical acquisition, each word In this work we present a Bayesian account of lin- is replaced by its syntactic category.1 Ungrammatical guistic structure learning in order to engage with the sentences and the most grammatically complex sentence PoS argument on its own terms – taking the existence types are removed.2 The final corpus contains 21792 in- of structure seriously and asking whether and to what dividual sentence tokens corresponding to 2338 unique extent knowledge of that structure can be inferred by a sentence types out of 25876 tokens in the original cor- rational statistical learner. This is an ideal learnability pus.3 Removing the complicated sentence types, done analysis: our question is not whether a learner without to improve the tractability of the analysis, is if anything innate language-specific biases must be able infer that a conservative move since the hierarchical grammar is linguistic structure is hierarchical, but rather whether it more preferred as the input grows more complicated. is possible to make that inference. It thus addresses the In order to explore how the preference for a grammar exact challenge posed by the PoS argument, which holds is dependent on the level of evidence in the input, we that such an inference is not possible. create six smaller corpora as subsets of the main corpus. The Bayesian approach provides the capability of com- Under the reasoning that the most frequent sentences bining structured representation with statistical infer- are most available as evidence,4 different corpus Levels ence, which enables us to achieve a number of important contain only those sentence forms that occur with a cer- goals. (1) We demonstrate that a learner equipped with tain frequency in the full corpus. The levels are: Level the capacity to explicitly represent both hierarchical and 1 (contains all forms occurring 500 or more times, cor- linear grammars – but without any initial biases – could responding to 8 unique types); Level 2 (300 times, 13 infer that the hierarchical grammar is a better fit to typ- types); Level 3 (100 times, 37 types); Level 4 (50 times, ical child-directed input. (2) We show that inferring this 67 types); Level 5 (10 times, 268 types); and the com- hierarchical grammar results in the mastery of aspects of plete corpus, Level 6, with 2338 unique types, includ- auxiliary fronting, even if no direct evidence is available. ing interrogatives, wh-questions, relative clauses, prepo- (3) Our approach provides a clear and objectively sensi- sitional and adverbial phrases, command forms, and aux- ble metric of simplicity, as well as a way to explore what iliary as well as non-auxiliary verbs. sort of data and how much is required to make these hierarchical generalizations. And (4) our results suggest that PoS arguments are sensible only when phenomena 1Parts of speech used included determiners (det), nouns are considered as part of a linguistic system, rather than (n), adjectives (adj), comments like “mmhm” (c, sentence fragments only), prepositions (prep), pronouns (pro), proper taken in isolation. nouns (prop), infinitives (to), participles (part), infinitive verbs (vinf), conjugated verbs (v), auxiliary verbs (aux), com- Method plementizers (comp), and wh-question words (wh). Adverbs and negations were removed from all sentences. We formalize the problem of picking the grammar that 2Removed types included topicalized sentences (66 utter- best fits a corpus of child-directed speech as an instance ances), sentences containing subordinate phrases (845), sen- of Bayesian model selection. The model assumes that tential complements (1636), conjunctions (634), serial verb constructions (459), and ungrammatical sentences (444). linguistic data is generated by first picking a type of 3The final corpus contained forms corresponding to 7371 grammar T , then selecting as an instance of that type sentence fragments. In order to ensure that the high num- a specific grammar G from which the data D is gener- ber of fragments did not affect the results, all analyses were ated. We compare grammars according to a probabilistic also performed for the corpus with those sentences removed. There was no qualitative change in any of the findings. score that combines the prior probability of G and T and 4 the likelihood of corpus data D given that grammar, in Partitioning in this way, by frequency alone, allows us to stratify the input in a principled way; additionally, the accordance with Bayes’ rule: higher levels include not only rarer forms but also more com- plex ones, and thus levels may be thought of as loosely cor- p(G, T |D) ∝ p(D|G, T )p(G|T )p(T ) responding to complexity. The grammars Context-free grammar Because this work is motivated by the distinction be- NP → NP PP | NP CP | NP C | N | det N | adj N tween rules operating over linear and hierarchical rep- pro | prop resentations, we would like to compare grammars that N → n | adj N differ structurally. The hierarchical grammar is context- Regular grammar free, since CFGs generate parse trees with hierarchical NP → pro | prop | n | det N | adj N structure and are accepted as a reasonable “first approxi- pro PP | prop PP | n PP | det NPP | adj NPP mation” to the grammars of natural language (Chomsky, pro CP | prop CP | n CP | det NCP | adj NCP 1959). We choose two different types of linear (structure- pro C | prop C | n C | det NC | adj NC independent) grammars. The first, which we call the flat N → n | adj N NPP → n PP | adj NPP grammar, is simply a list of each of the sentences that N → n CP | adj N N → n C | adj N occur in the corpus; it contains zero non-terminals (aside CP CP C C from S) and 2338 productions corresponding to each of Table 1: Sample NP productions from two grammar types. the sentence types. Because Chomsky often compared of right-hand-side items each production contains. Fi- language to a Markov model, we consider a regular gram- nally, for each item, a specific symbol is selected from mar as well. the set of possible vocabulary (non-terminals and ter- Though the flat and regular grammars may not be of minals). The prior probability for a grammar with V the precise form envisioned by Chomsky, we work with vocabulary items, n nonterminals, P productions and them because they are representative of simple syntac- N symbols for production i is thus given by:6 tic systems one might define over the linear sequence of i YP YNi words rather than the hierarchical structure of phrases; 1 p(G|T ) = p(P )p(n) p(N ) (1) additionally, it is straightforward to define them in prob- i V i=1 j=1 abilistic terms in order to do Bayesian model selection. All grammars are probabilistic, meaning that each pro- Because of the small numbers involved, all calculations duction is associated with a probability and the probabil- are done in the log domain. For simplicity, p(P ), p(n), ity of any given parse is the product of the probabilities and p(Ni) are all assumed to be geometric distributions of the productions involved in the derivation. with parameter 0.5.7 Thus, grammars with fewer pro- The probabilistic context-free grammar (PCFG) is the ductions and symbols are given higher prior probability. most linguistically accurate grammar we could devise Notions such as minimum description length and Kol- that could parse all of the forms in the corpus: as such, it mogorov complexity are also used to capture inductive contains the syntactic structures that modern linguists employ, such as noun and verb phrases. The full gram- biases towards simpler grammars (Chater and Vitanyi, mar, used for the Level 6 corpus, contains 14 terminals, 2003; Li and Vitanyi, 1997). We adopt a probabilistic 14 nonterminals, and 69 productions. All grammars at formulation of the simplicity bias because it is efficiently other levels include only the subset of productions and computable, derives in a principled way from a clear items necessary to parse that corpus. generative model, and integrates naturally with how we The probabilistic regular grammar (PRG) is derived assess the fit to corpus data, using standard likelihood directly from the context-free grammar by converting all methods for probabilistic grammars. productions not already consistent with the formalism of regular grammar (A → a or A → aB). When possible to Scoring the grammars: likelihood do so without loss of generalization ability, the resulting Inspired by Goldwater et al. (2005), the likelihood is productions are simplified and any unused productions calculated assuming a language model that is divided are eliminated. The final regular grammar contains 14 terminals, 85 non-terminals, and 390 productions. The into two components. The first component, the gram- number of productions is greater than in the PCFG be- mar, assigns a probability distribution over the poten- cause each context-free production containing two non- tially infinite set of syntactic forms that are accepted in terminals in a row must be expanded into a series of the language. The second component generates a finite productions (e.g. NP → NP PP expands to NP → pro 6 PP, NP → n PP, etc). To illustrate this, Table 1 com- This probability is calculated in subtly different ways 5 for each grammar type, because of the different constraints pares NPs in the context-free and regular grammars. each kind of grammar places on the kinds of symbols that can appear in production rules. For instance, with regu- Scoring the grammars: prior probability lar grammars, because the first right-hand-side item in each production must be a terminal, the effective vocabulary size We assume a generative model for creating the gram- 1 V when choosing that item is # terminals . However, for the mars under which each grammar is selected from the second right-hand-side item in a regular-grammar produc- space of grammars by making a series of choices: first, tion or for any item in a CFG production, the effective V the grammar type T (flat, regular, or context-free); next, 1 is # terminals + # non-terminals , because that item can be ei- the number of non-terminals, productions, and number ther a terminal or a non-terminal. This prior thus slightly favors linear grammars over functionally equivalent context- 5The full grammars are available at http://www.mit.edu/ free grammars. ∼perfors/cogsci06/archive.html. 7Qualitative results are similar for other parameter values. observed corpus from the infinite set of forms produced adding so many new productions that this early cost is by the grammar, and can account for the characteristic never regained. The context-free grammar is more com- power-law distributions found in language (Zipf, 1932). plicated than necessary on the smallest corpus, requiring In essence, this two-component model assumes separate 17 productions and 7 nonterminals to parse just eight generative processes for the allowable types of syntac- sentences, and thus has the lowest relative prior proba- tic forms in a language and for the frequency of specific bility. However, its generalization ability is sufficiently sentence tokens. great that additions to the corpus require few additional One advantage of this approach is that grammars are productions: as a result, it quickly becomes simpler than analyzed based on individual sentence types rather than either of the linear grammars. on the frequencies of different sentence forms. This par- What is responsible for the transition from linear to allels standard linguistic practice: grammar learning is hierarchical grammars? Smaller corpora do not con- based on how well each grammar accounts for the set of tain elements generated from recursive productions (e.g., grammatical sentences rather than their frequency dis- nested prepositional phrases, NPs with multiple adjec- tribution. Since we are concerned with grammar com- tives, or relative clauses) or multiple sentences using the parison rather than corpus generation, we focus here on same phrase in different positions (e.g., a prepositional the first component of the model. phrase modifying an NP subject, an NP object, a verb, The likelihood p(D|G, T ) reflects how likely the corpus or an adjective phrase). While a regular grammar must data D was generated by the grammar G. It is calculated often add an entire new subset of productions to ac- as the product of the likelihoods of each sentence type S count for them, as is evident in the subset of the gram- in the corpus. If the set of sentences is partitioned into mar shown in Table 1, a PCFG need add few or none. k unique types, the log likelihood is given by: As a consequence, both linear grammars have poorer Xk generalization ability and must add proportionally more log(p(D|G, T )) = log(p(Si|G, T )) (2) productions in order to parse a novel sentence. i=1

The probability p(Si|G, T ) of generating any sentence Likelihoods type i is the sum of the probabilities of generating all pos- The likelihood scores for each grammar on each corpus sible parses of that sentence under the grammar G. The are shown in Table 2. It is not surprising that the flat probability of a specific parse is the product of the prob- grammar has the highest likelihood score on all six cor- ability of each product