On bootstrapping of linguistic features for bootstrapping grammars

Damir Cavar´ University of Zadar Zadar, Croatia [email protected]

Abstract Ignoring for the time being extra-linguistic con- ditions and cues for linguistic properties, and in- We discuss a cue-based grammar induc- dependent of the complexity of specific linguis- tion approach based on a parallel theory of tic levels for particular languages, we assume grammar. Our model is based on the hy- that specific properties at one particular linguistic potheses of interdependency between lin- level correlate with properties at another level. In guistic levels (of representation) and in- natural languages certain phonological processes ductability of specific structural properties might be triggered at morphological boundaries at a particular level, with consequences only, e.g. (Chomsky and Halle, 1968), or prosodic for the induction of structural properties at properties correlate with syntactic phrase bound- other linguistic levels. We present the re- aries and semantic properties, e.g. (Inkelas and sults of three different cue-learning exper- Zec, 1990). Similarly, lexical properties, as for iments and settings, covering the induc- example stress patterns and morphological struc- tion of phonological, morphological, and ture tend to be specific to certain word types (e.g. syntactic properties, and discuss potential substantives, but not function words). i.e. corre- consequences for our general grammar in- late with the lexical morpho-syntactic properties duction model.1 used in grammars of . Other more informal correlations that are discussed in linguistics, that rather lack a formal model or explanation, are for 1 Introduction example the relation between morphological rich- ness and the freedom of word order in syntax. We assume that individual linguistic levels of nat- Thus, it seems that specific regularities and ural languages differ with respect to their for- grammatical properties at one linguistic level mal complexity. In particular, the assumption is might provide cues for structural properties at an- that structural properties of linguistic levels like other level. We expect such correlations to be lan- phonology or morphology can be characterized guage specific, given that languages qualitatively fully by Regular grammars, and if not, at least a significantly differ at least at the phonetic, phono- large subset can. Structural properties of natural logical and morphological level, and at least quan- language syntax on the other hand might be char- titatively also at the syntactic level. acterized by Mildly context-free grammars (Joshi Thus in our model of grammar induction, we et al., 1991), where at least a large subset could be favor the view expressed e.g. in (Frank, 2000) characterized by Regular and Context-free gram- that complex grammars are bootstrapped (or grow) 2 mars. from less complex grammars. On the other hand, the intuition that structural or inherent proper- 1This article is builds on joint work and articles with K. Elghamri, J. Herring, T. Ikuta, P. Rodrigues, G. Schrementi ties at different linguistic levels correlate, i.e. they and colleagues at the Institute of Croatian Language and Lin- seem to be used as cues in processing and acquisi- guistics and the University of Zadar. The research activities were partially funded by several grants over a couple of years, tion, might require a parallel model of language at Indiana University and from the Croatian Ministry of Sci- learning or grammar induction, as for example ence, Education and Sports of the Republic of Croatia. suggested in (Jackendoff, 1996) or the Competi- 2We are abstracting away from concrete linguistic models and theories, and their particular complexity, as discussed e.g. tion Model (MacWhinney and Bates, 1989). in (Ristad, 1990) or (Tesar and Smolensky, 2000). In general, we start with the observation that

Proceedings of the EACL 2009 Workshop on Computational Linguistic Aspects of Grammatical Inference, pages 5–6, Athens, Greece, 30 March 2009. c 2009 Association for Computational Linguistics

5 natural languages are learnable. In principle, the computational Models of Human Language Acqui- study of how this might be modeled, and what the sition, pages 9–16. minimal assumptions about the grammar proper- Damir Cavar,´ Joshua Herring, Toshikazu Ikuta, Paul ties and the induction algorithm could be, could Rodrigues, and Giancarlo Schrementi. 2004c. Syn- start top-down, by assuming maximal knowledge tactic parsing using mutual information and relative of the target grammar, and subsequently eliminat- entropy. Midwest Computational Linguistics Collo- ing elements that are obviously learnable in an un- quium (MCLC), June. supervised way, or fall out as side-effects. Alter- and Morris Halle. 1968. The Sound natively, a bottom-up approach could start with the Pattern of English. Harper & Row, New York. question about how much supervision has to be Robert Frank. 2000. From regular to context free to added to an unsupervised model in order to con- mildly context sensitive tree rewriting systems: The verge to a concise grammar. path of child . In A. Abeille´ Here we favor the bottom-up approach, and ask and O. Rambow, editors, Tree Adjoining Gram- mars: Formalisms, Linguistic Analysis and Process- how simple properties of grammar can be learned ing, pages 101–120. CSLI Publications. in an unsupervised way, and how cues could be identified that allow for the induction of higher Sharon Inkelas and Draga Zec. 1990. The Phonology- level properties of the target grammar, or other lin- Syntax Connection. Press, Chicago. guistic levels, by for example favoring some struc- tural hypotheses over others. Ray Jackendoff. 1996. The Architecture of the Lan- In this article we will discuss in detail sev- guage Faculty. Number 28 in Linguistic Inquiry Monographs. MIT Press, Cambridge, MA. eral experiments of morphological cue induction for lexical classification (Cavar´ et al., 2004a) and Aravind Joshi, K. Vijay-Shanker, and David Weird. (Cavar´ et al., 2004b) using Vector Space Models 1991. The convergence of mildly context-sensitive for category induction and subsequent rule for- grammar formalisms. In Peter Sells, Stuart Shieber, and Thomas Wasow, editors, Foundational Issues in mation. Furthermore, we discuss structural cohe- Natural Language Processing, pages 31–81. MIT sion measured via Entropy-based statistics on the Press, Cambridge, MA. basis of distributional properties for unsupervised ´ Brian MacWhinney and Elizabeth Bates. 1989. The syntactic structure induction (Cavar et al., 2004c) Crosslinguistic Study of Sentence Processing. Cam- from raw text, and compare the results with syn- bridge University Press, New York. tactic corpora like the Penn Treebank. We ex- pand these results with recent experiments in the Eric S. Ristad. 1990. Computational structure of gen- erative phonology and its relation to language com- domain of unsupervised induction of phonotactic prehension. In Proceedings of the 28th annual meet- regularities and phonological structure (Cavar´ and ing on Association for Computational Linguistics, Cavar,´ 2009), providing cues for morphological pages 235–242. Association for Computational Lin- structure induction and syntactic phrasing. guistics. Bruce Tesar and Paul Smolensky. 2000. Learnability in Optimality Theory. MIT Press, Cambridge, MA. References Damir Cavar´ and Małgorzata E. Cavar.´ 2009. On the induction of linguistic categories and learning gram- mars. Paper presented at the 10th Szklarska Poreba Workshop, March.

Damir Cavar,´ Joshua Herring, Toshikazu Ikuta, Paul Rodrigues, and Giancarlo Schrementi. 2004a. Alignment based induction of morphology grammar and its role for bootstrapping. In Gerhard Jager,¨ Paola Monachesi, Gerald Penn, and Shuly Wint- ner, editors, Proceedings of Formal Grammar 2004, pages 47–62, Nancy.

Damir Cavar,´ Joshua Herring, Toshikazu Ikuta, Paul Rodrigues, and Giancarlo Schrementi. 2004b. On statistical bootstrapping. In William G. Sakas, ed- itor, Proceedings of the First Workshop on Psycho-

6