EACL 2009

Proceedings of the EACL 2009 Workshop on Computational Linguistic Aspects of Grammatical Inference

CLAGI 2009

30 March – 2009 Megaron Athens International Conference Centre Athens, Greece Production and Manufacturing by TEHNOGRAFIA DIGITAL PRESS 7 Ektoros Street 152 35 Vrilissia Athens, Greece

c 2009 The Association for Computational Linguistics

Order copies of this and other ACL proceedings from:

Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected]

ii Preface

We are delighted to present you with this volume containing the papers accepted for presentation at the 1st workshop CLAGI 2009, held in Athens, Greece, on March 30th 2009.

We want to acknowledge the help of the PASCAL 2 network of excellence.

Thanks also to Damir Cavar´ for giving an invited talk and to the programme committee for the reviewing and advising.

We are indebted to the general chair of EACL 2009, Alex Lascarides, to the publication chairs, Kemal Oflazer and David Schlangen, and to the Workshop chairs Stephen Clark and Miriam Butt, for all their help. We are also grateful for Thierry Murgue’s assistance when putting together the proceedings.

Wishing you a very enjoyable time at CLAGI 2009!

Menno van Zaanen and Colin de la Higuera CLAGI 2009 Programme Chairs

iii

CLAGI 2009 Program Committee

Program Chairs:

Menno van Zaanen, Tilburg University (The Netherlands) Colin de la Higuera, University of Saint-Etienne,´ (France)

Program Committee Members:

Pieter Adriaans, University of Amsterdam (The Netherlands) Srinivas Bangalore, AT&T Labs-Research (USA) Leonor Becerra-Bonache, Yale University (USA) Rens Bod, University of Amsterdam (The Netherlands) Antal van den Bosch, Tilburg University (The Netherlands) Alexander Clark, Royal Halloway, University of London (UK) Walter Daelemans, University of Antwerp (Belgium) Shimon Edelman, Cornell University (USA) Jeroen Geertzen, University of Cambridge (UK) Jeffrey Heinz, University of Delaware (USA) Colin de la Higuera, University of Saint-Etienne´ (France) Alfons Juan, Polytechnic University of Valencia (Spain) Frantisek Mraz, Charles University (Czech Republic) Georgios Petasis, National Centre for Scientific Research (NCSR) ”Demokritos” (Greece) Khalil Sima’an, University of Amsterdam (The Netherlands) Richard Sproat, University of Illinois at Urbana-Champaign (USA) Menno van Zaanen, Tilburg University (The Netherlands) Willem Zuidema, University of Amsterdam (The Netherlands)

v

Table of Contents

Grammatical Inference and Computational Linguistics Menno van Zaanen and Colin de la Higuera ...... 1

On Bootstrapping of Linguistic Features for Bootstrapping Grammars Damir Cavar...... ´ 5

Dialogue Act Prediction Using Stochastic Context-Free Grammar Induction Jeroen Geertzen ...... 7

Experiments Using OSTIA for a Language Production Task Dana Angluin and Leonor Becerra-Bonache ...... 16

GREAT: A Finite-State Machine Translation Toolkit Implementing a Grammatical Inference Approach for Transducer Inference (GIATI) Jorge Gonzalez´ and Francisco Casacuberta ...... 24

A Note on Contextual Binary Feature Grammars Alexander Clark, Remi Eyraud and Amaury Habrard ...... 33

Language Models for Contextual Error Detection and Correction Herman Stehouwer and Menno van Zaanen ...... 41

On Statistical Parsing of French with Supervised and Semi-Supervised Strategies Marie Candito, Benoit Crabbe´ and Djame´ Seddah ...... 49

Upper Bounds for Unsupervised Parsing with Unambiguous Non-Terminally Separated Grammars Franco M. Luque and Gabriel Infante-Lopez...... 58

Comparing Learners for Boolean partitions: Implications for Morphological Paradigms Katya Pertsova ...... 66

vii

Grammatical Inference and Computational Linguistics

Menno van Zaanen Colin de la Higuera Tilburg Centre for Creative Computing University of Saint-Etienne´ Tilburg University France Tilburg, The Netherlands [email protected] [email protected]

1 Grammatical inference and its links to larly, stemming from computational linguistics, natural language processing one can point out the work relating language learn- ing with more complex grammatical formalisms When dealing with language, (machine) learning (Kanazawa, 1998), the more statistical approaches can take many different faces, of which the most based on building language models (Goodman, important are those concerned with learning lan- 2001), or the different systems introduced to au- guages and grammars from data. Questions in tomatically build grammars from sentences (van this context have been at the intersection of the Zaanen, 2000; Adriaans and Vervoort, 2002). Sur- fields of inductive inference and computational veys of related work in specific fields can also linguistics for the past fifty years. To go back be found (Natarajan, 1991; Kearns and Vazirani, to the pioneering work, Chomsky (1955; 1957) 1994; Sakakibara, 1997; Adriaans and van Zaa- and Solomonoff (1960; 1964) were interested, for nen, 2004; de la Higuera, 2005; Wolf, 2006). very different reasons, in systems or programs that could deduce a language when presented informa- 2 Meeting points between grammatical tion about it. inference and natural language Gold (1967; 1978) proposed a little later a uni- processing fying paradigm called identification in the limit, Grammatical inference scientists belong to a num- and the term of grammatical inference seems to ber of larger communities: machine learning (with have appeared in Horning’s PhD thesis (1969). special emphasis on inductive inference), com- Out of the scope of linguistics, researchers and putational linguistics, pattern recognition (within engineers dealing with pattern recognition, under the structural and syntactic sub-group). There is the impulsion of Fu (1974; 1975), invented algo- a specific conference called ICGI (International rithms and studied subclasses of languages and Colloquium on Grammatical Inference) devoted grammars from the point of view of what could to the subject. These conferences have been held or could not be learned. at Alicante (Carrasco and Oncina, 1994), Mont- Researchers in machine learning tackled related pellier (Miclet and de la Higuera, 1996), Ames problems (the most famous being that of infer- (Honavar and Slutski, 1998), Lisbon (de Oliveira, ring a deterministic finite automaton, given ex- 2000), Amsterdam (Adriaans et al., 2002), Athens amples and counter-examples of strings). An- (Paliouras and Sakakibara, 2004), Tokyo (Sakak- gluin (1978; 1980; 1981; 1982; 1987) introduced ibara et al., 2006) and Saint-Malo (Clark et al., the important setting of active learning, or learn- 2008). In the proceedings of this event it is pos- ing for queries, whereas Pitt and his colleagues sible to find a number of technical papers. Within (1988; 1989; 1993) gave several complexity in- this context, there has been a growing trend to- spired results with which the hardness of the dif- wards problems of language learning in the field ferent learning problems was exposed. of computational linguistics. Researchers working in more applied areas, The formal objects in common between the such as computational biology, also deal with two communities are the different types of au- strings. A number of researchers from that tomata and grammars. Therefore, another meet- field worked on learning grammars or automata ing point between these communities has been the from string data (Brazma and Cerans, 1994; different workshops, conferences and journals that Brazma, 1997; Brazma et al., 1998). Simi- focus on grammars and automata, for instance,

Proceedings of the EACL 2009 Workshop on Computational Linguistic Aspects of Grammatical Inference, pages 1–4, Athens, Greece, 30 March 2009. c 2009 Association for Computational Linguistics

1 FSMNLP,GRAMMARS, CIAA,... • Unsupervised parsing 3 Goal for the workshop • Language modelling There has been growing interest over the last few • Transducers, for instance, for years in learning grammars from natural language – morphology, text (and structured or semi-structured text). The – text to speech, family of techniques enabling such learning is usu- – automatic translation, ally called “grammatical inference” or “grammar – transliteration, induction”. The field of grammatical inference is often sub- – spelling correction, . . . divided into formal grammatical inference, where • Learning syntax with semantics, researchers aim to proof efficient learnability of classes of grammars, and empirical grammatical • Unsupervised or semi-supervised learning of inference, where the aim is to learn structure from linguistic knowledge, data. In this case the existence of an underlying • Learning (classes of) grammars (e.g. sub- grammar is just regarded as a hypothesis and what classes of the Chomsky Hierarchy) from lin- is sought is to better describe the language through guistic inputs, some automatically learned rules. Both formal and empirical grammatical infer- • Comparing learning results in different ence have been linked with (computational) lin- frameworks (e.g. membership vs. correction guistics. Formal learnability of grammars has queries), been used in discussions on how people learn lan- • Learning linguistic structures (e.g. phonolog- guage. Some people mention proofs of (non- ical features, lexicon) from the acoustic sig- )learnability of certain classes of grammars as ar- nal, guments in the empiricist/nativist discussion. On the more practical side, empirical systems that • Grammars and finite state machines in ma- learn grammars have been applied to natural lan- chine translation, guage. Instead of proving whether classes of • grammars can be learnt, the aim here is to pro- Learning setting of Chomskyan parameters, vide practical learning systems that automatically • Cognitive aspects of grammar acquisition, introduce structure in language. Example fields covering, among others, where initial research has been done are syntac- tic parsing, morphological analysis of words, and – developmental trajectories as studied by bilingual modelling (or machine translation). psycholinguists working with children, This workshop organized at EACL 2009 aimed – characteristics of child-directed speech to explore the state-of-the-art in these topics. In as they are manifested in corpora such particular, we aimed at bringing formal and empir- as CHILDES, ... ical grammatical inference researchers closer to- • (Unsupervised) Computational language ac- gether with researchers in the field of computa- quisition (experimental or observational), tional linguistics. The topics put forward were to cover research 4 The papers on all aspects of grammatical inference in rela- The workshop was glad to have as invited speaker tion to natural language (such as, syntax, seman- Damir Cavar,´ who presented a talk titled: On boot- tics, morphology, phonology, phonetics), includ- strapping of linguistic features for bootstrapping ing, but not limited to grammars. • Automatic grammar engineering, including, The papers submitted to the workshop and re- for example, viewed by at least three reviewers each, covered a very wide range of problems and techniques. Ar- – parser construction, ranging them into patterns was not a simple task! – parameter estimation, There were three papers focussing on transduc- – smoothing, . . . ers:

2 • Jeroen Geertzen shows in his paper Dialogue • In A comparison of several learners for Act Prediction Using Stochastic Context-Free Boolean partitions: implications for morpho- Grammar Induction, how grammar induction logical paradigm, Katya Pertsova compares a can be used in dialogue act prediction. rote learner to three morphological paradigm learners. • In their paper (Experiments Using OSTIA for a Language Production Task), Dana Angluin and Leonor Becerra-Bonache build on previ- ous work to see the transducer learning algo- References rithm OSTIA as capable of translating syn- P. Adriaans and M. van Zaanen. 2004. Computational tax to semantics. grammar induction for linguists. Grammars, 7:57– 68. • In their paper titled GREAT: a finite-state P. Adriaans and M. Vervoort. 2002. The EMILE machine translation toolkit implementing a 4.1 grammar induction toolbox. In Adriaans et al. Grammatical Inference Approach for Trans- (Adriaans et al., 2002), pages 293–295. ducer Inference (GIATI), Jorge Gonz´alez and P. Adriaans, H. Fernau, and M. van Zaannen, editors. Francisco Casacuberta build on a long his- 2002. Grammatical Inference: Algorithms and Ap- tory of GOATI learning and try to eliminate plications, Proceedings of ICGI ’02, volume 2484 some of the limitations of previous work. of LNAI, Berlin, Heidelberg. Springer-Verlag. The learning concerns finite-state transducers D. Angluin. 1978. On the complexity of minimum from parallel corpora. inference of regular sets. Information and Control, 39:337–350. Context-free grammars of different types were used for very different tasks: D. Angluin. 1980. Inductive inference of formal lan- guages from positive data. Information and Control, • Alexander Clark, Remi Eyraud and Amaury 45:117–135. Habrard (A note on contextual binary fea- D. Angluin. 1981. A note on the number of queries ture grammars) propose a formal study of needed to identify regular languages. Information a new formalism called “CBFG”, describe and Control, 51:76–87. the relationship of CBFG to other standard D. Angluin. 1982. Inference of reversible languages. formalisms and its appropriateness for mod- Journal of the Association for Computing Machin- elling natural language. ery, 29(3):741–765.

• In their work titled Language models for con- D. Angluin. 1987. Queries and concept learning. Ma- chine Learning Journal, 2:319–342. textual error detection and correction, Her- man Stehouwer and Menno van Zaanen look A. Brazma and K. Cerans. 1994. Efficient learning at spelling problems as a word prediction of regular expressions from good examples. In AII ’94: Proceedings of the 4th International Workshop problem. The prediction needs a language on Analogical and Inductive Inference, pages 76–90. model which is learnt. Springer-Verlag. • A formal study of French treebanks is made A. Brazma, I. Jonassen, J. Vilo, and E. Ukkonen. 1998. by Marie-H´el`ene Candito, Benoit Crabb´eand Pattern discovery in biosequences. In Honavar and Slutski (Honavar and Slutski, 1998), pages 257–270. Djam´eSeddah in their work: On statistical parsing of French with supervised and semi- A. Brazma, 1997. Computational learning theory and supervised strategies. natural learning systems, volume 4, chapter Effi- cient learning of regular expressions from approxi- • Franco M. Luque and Gabriel Infante-Lopez mate examples, pages 351–366. MIT Press. study the learnability of NTS grammars with R. C. Carrasco and J. Oncina, editors. 1994. Gram- reference to the Penn treebank in their paper matical Inference and Applications, Proceedings of titled Upper Bounds for Unsupervised Pars- ICGI ’94, number 862 in LNAI, Berlin, Heidelberg. ing with Unambiguous Non-Terminally Sep- Springer-Verlag. arated Grammars. N. Chomsky. 1955. The logical structure of linguis- tic theory. Ph.D. thesis, Massachusetts Institute of One paper concentrated on morphology : Technology.

3 N. Chomsky. 1957. Syntactic structure. Mouton. L. Pitt and M. Warmuth. 1993. The minimum consis- tent DFA problem cannot be approximated within A. Clark, F. Coste, and L. Miclet, editors. 2008. any polynomial. Journal of the Association for Grammatical Inference: Algorithms and Applica- Computing Machinery, 40(1):95–142. tions, Proceedings of ICGI ’08, volume 5278 of LNCS. Springer-Verlag. L. Pitt. 1989. Inductive inference, DFA’s, and com- putational complexity. In Analogical and Induc- C. de la Higuera. 2005. A bibliographical study tive Inference, number 397 in LNAI, pages 18–44. of grammatical inference. Pattern Recognition, Springer-Verlag, Berlin, Heidelberg. 38:1332–1348. Y. Sakakibara, S. Kobayashi, K. Sato, T. Nishino, and A. L. de Oliveira, editor. 2000. Grammatical Infer- E. Tomita, editors. 2006. Grammatical Infer- ence: Algorithms and Applications, Proceedings of ence: Algorithms and Applications, Proceedings of ICGI ’00, volume 1891 of LNAI, Berlin, Heidelberg. ICGI ’06, volume 4201 of LNAI, Berlin, Heidelberg. Springer-Verlag. Springer-Verlag. Y. Sakakibara. 1997. Recent advances of grammatical K. S. Fu and T. L. Booth. 1975. Grammatical infer- inference. Theoretical Computer Science, 185:15– ence: Introduction and survey. Part I and II. IEEE 45. Transactions on Syst. Man. and Cybern., 5:59–72 and 409–423. R. Solomonoff. 1960. A preliminary report on a gen- eral theory of inductive inference. Technical Report K. S. Fu. 1974. Syntactic Methods in Pattern Recogni- ZTB-138, Zator Company, Cambridge, Mass. tion. Academic Press, New-York. R. Solomonoff. 1964. A formal theory of inductive E. M. Gold. 1967. Language identification in the limit. inference. Information and Control, 7(1):1–22 and Information and Control, 10(5):447–474. 224–254.

E. M. Gold. 1978. Complexity of automaton identi- M. van Zaanen. 2000. ABL: Alignment-based learn- fication from given data. Information and Control, ing. In Proceedings of COLING 2000, pages 961– 37:302–320. 967. Morgan Kaufmann.

J. Goodman. 2001. A bit of progress in language mod- G. Wolf. 2006. Unifying computing and cognition. eling. Technical report, Microsoft Research. Cognition research.

V. Honavar and G. Slutski, editors. 1998. Gram- matical Inference, Proceedings of ICGI ’98, number 1433 in LNAI, Berlin, Heidelberg. Springer-Verlag.

J. J. Horning. 1969. A study of Grammatical Inference. Ph.D. thesis, Stanford University.

M. Kanazawa. 1998. Learnable Classes of Categorial Grammars. CSLI Publications, Stanford, Ca.

M. J. Kearns and U. Vazirani. 1994. An Introduction to Computational Learning Theory. MIT press.

L. Miclet and C. de la Higuera, editors. 1996. Pro- ceedings of ICGI ’96, number 1147 in LNAI, Berlin, Heidelberg. Springer-Verlag.

B. L. Natarajan. 1991. Machine Learning: a Theoret- ical Approach. Morgan Kauffman Pub., San Mateo, CA.

G. Paliouras and Y. Sakakibara, editors. 2004. Gram- matical Inference: Algorithms and Applications, Proceedings of ICGI ’04, volume 3264 of LNAI, Berlin, Heidelberg. Springer-Verlag.

L. Pitt and M. Warmuth. 1988. Reductions among prediction problems: on the difficulty of predicting automata. In 3rd Conference on Structure in Com- plexity Theory, pages 60–69.

4 On bootstrapping of linguistic features for bootstrapping grammars

Damir Cavar´ University of Zadar Zadar, Croatia [email protected]

Abstract Ignoring for the time being extra-linguistic con- ditions and cues for linguistic properties, and in- We discuss a cue-based grammar induc- dependent of the complexity of specific linguis- tion approach based on a parallel theory of tic levels for particular languages, we assume grammar. Our model is based on the hy- that specific properties at one particular linguistic potheses of interdependency between lin- level correlate with properties at another level. In guistic levels (of representation) and in- natural languages certain phonological processes ductability of specific structural properties might be triggered at morphological boundaries at a particular level, with consequences only, e.g. (Chomsky and Halle, 1968), or prosodic for the induction of structural properties at properties correlate with syntactic phrase bound- other linguistic levels. We present the re- aries and semantic properties, e.g. (Inkelas and sults of three different cue-learning exper- Zec, 1990). Similarly, lexical properties, as for iments and settings, covering the induc- example stress patterns and morphological struc- tion of phonological, morphological, and ture tend to be specific to certain word types (e.g. syntactic properties, and discuss potential substantives, but not function words). i.e. corre- consequences for our general grammar in- late with the lexical morpho-syntactic properties duction model.1 used in grammars of syntax. Other more informal correlations that are discussed in linguistics, that rather lack a formal model or explanation, are for 1 Introduction example the relation between morphological rich- ness and the freedom of word order in syntax. We assume that individual linguistic levels of nat- Thus, it seems that specific regularities and ural languages differ with respect to their for- grammatical properties at one linguistic level mal complexity. In particular, the assumption is might provide cues for structural properties at an- that structural properties of linguistic levels like other level. We expect such correlations to be lan- phonology or morphology can be characterized guage specific, given that languages qualitatively fully by Regular grammars, and if not, at least a significantly differ at least at the phonetic, phono- large subset can. Structural properties of natural logical and morphological level, and at least quan- language syntax on the other hand might be char- titatively also at the syntactic level. acterized by Mildly context-free grammars (Joshi Thus in our model of grammar induction, we et al., 1991), where at least a large subset could be favor the view expressed e.g. in (Frank, 2000) characterized by Regular and Context-free gram- that complex grammars are bootstrapped (or grow) 2 mars. from less complex grammars. On the other hand, the intuition that structural or inherent proper- 1This article is builds on joint work and articles with K. Elghamri, J. Herring, T. Ikuta, P. Rodrigues, G. Schrementi ties at different linguistic levels correlate, i.e. they and colleagues at the Institute of Croatian Language and Lin- seem to be used as cues in processing and acquisi- guistics and the University of Zadar. The research activities were partially funded by several grants over a couple of years, tion, might require a parallel model of language at Indiana University and from the Croatian Ministry of Sci- learning or grammar induction, as for example ence, Education and Sports of the Republic of Croatia. suggested in (Jackendoff, 1996) or the Competi- 2We are abstracting away from concrete linguistic models and theories, and their particular complexity, as discussed e.g. tion Model (MacWhinney and Bates, 1989). in (Ristad, 1990) or (Tesar and Smolensky, 2000). In general, we start with the observation that

Proceedings of the EACL 2009 Workshop on Computational Linguistic Aspects of Grammatical Inference, pages 5–6, Athens, Greece, 30 March 2009. c 2009 Association for Computational Linguistics

5 natural languages are learnable. In principle, the computational Models of Human Language Acqui- study of how this might be modeled, and what the sition, pages 9–16. minimal assumptions about the grammar proper- Damir Cavar,´ Joshua Herring, Toshikazu Ikuta, Paul ties and the induction algorithm could be, could Rodrigues, and Giancarlo Schrementi. 2004c. Syn- start top-down, by assuming maximal knowledge tactic parsing using mutual information and relative of the target grammar, and subsequently eliminat- entropy. Midwest Computational Linguistics Collo- ing elements that are obviously learnable in an un- quium (MCLC), June. supervised way, or fall out as side-effects. Alter- Noam Chomsky and Morris Halle. 1968. The Sound natively, a bottom-up approach could start with the Pattern of English. Harper & Row, New York. question about how much supervision has to be Robert Frank. 2000. From regular to context free to added to an unsupervised model in order to con- mildly context sensitive tree rewriting systems: The verge to a concise grammar. path of child language acquisition. In A. Abeille´ Here we favor the bottom-up approach, and ask and O. Rambow, editors, Tree Adjoining Gram- mars: Formalisms, Linguistic Analysis and Process- how simple properties of grammar can be learned ing, pages 101–120. CSLI Publications. in an unsupervised way, and how cues could be identified that allow for the induction of higher Sharon Inkelas and Draga Zec. 1990. The Phonology- level properties of the target grammar, or other lin- Syntax Connection. University Of Chicago Press, Chicago. guistic levels, by for example favoring some struc- tural hypotheses over others. Ray Jackendoff. 1996. The Architecture of the Lan- In this article we will discuss in detail sev- guage Faculty. Number 28 in Linguistic Inquiry Monographs. MIT Press, Cambridge, MA. eral experiments of morphological cue induction for lexical classification (Cavar´ et al., 2004a) and Aravind Joshi, K. Vijay-Shanker, and David Weird. (Cavar´ et al., 2004b) using Vector Space Models 1991. The convergence of mildly context-sensitive for category induction and subsequent rule for- grammar formalisms. In Peter Sells, Stuart Shieber, and Thomas Wasow, editors, Foundational Issues in mation. Furthermore, we discuss structural cohe- Natural Language Processing, pages 31–81. MIT sion measured via Entropy-based statistics on the Press, Cambridge, MA. basis of distributional properties for unsupervised ´ Brian MacWhinney and Elizabeth Bates. 1989. The syntactic structure induction (Cavar et al., 2004c) Crosslinguistic Study of Sentence Processing. Cam- from raw text, and compare the results with syn- bridge University Press, New York. tactic corpora like the Penn Treebank. We ex- pand these results with recent experiments in the Eric S. Ristad. 1990. Computational structure of gen- erative phonology and its relation to language com- domain of unsupervised induction of phonotactic prehension. In Proceedings of the 28th annual meet- regularities and phonological structure (Cavar´ and ing on Association for Computational Linguistics, Cavar,´ 2009), providing cues for morphological pages 235–242. Association for Computational Lin- structure induction and syntactic phrasing. guistics. Bruce Tesar and Paul Smolensky. 2000. Learnability in Optimality Theory. MIT Press, Cambridge, MA. References Damir Cavar´ and Małgorzata E. Cavar.´ 2009. On the induction of linguistic categories and learning gram- mars. Paper presented at the 10th Szklarska Poreba Workshop, March.

Damir Cavar,´ Joshua Herring, Toshikazu Ikuta, Paul Rodrigues, and Giancarlo Schrementi. 2004a. Alignment based induction of morphology grammar and its role for bootstrapping. In Gerhard Jager,¨ Paola Monachesi, Gerald Penn, and Shuly Wint- ner, editors, Proceedings of Formal Grammar 2004, pages 47–62, Nancy.

Damir Cavar,´ Joshua Herring, Toshikazu Ikuta, Paul Rodrigues, and Giancarlo Schrementi. 2004b. On statistical bootstrapping. In William G. Sakas, ed- itor, Proceedings of the First Workshop on Psycho-

6 Dialogue Act Prediction Using Stochastic Context-Free Grammar Induction

Jeroen Geertzen Research Centre for English & Applied Linguistics University of Cambridge, UK [email protected]

Abstract implies that dialogue is modelled as a sequential decision task in which each contribution (action) This paper presents a model-based ap- results in a transition from one state to another. proach to dialogue management that is The latter assumption allows to assign a reward guided by data-driven dialogue act predic- for action-state pairs, and to determine the dia- tion. The statistical prediction is based on logue management strategy that results in the max- stochastic context-free grammars that have imum expected reward by finding for each state been obtained by means of grammatical the optimal action by using reinforcement learn- inference. The prediction performance of ing (cf. (Sutton and Barto, 1998)). Reinforce- the method compares favourably to that of ment learning approaches to dialogue manage- a heuristic baseline and to that of n-gram ment have proven to be successful in several task language models. domains (see for example (Paek, 2006; Lemon et The act prediction is explored both for al., 2006)). In this process there is no supervision, dialogue acts without realised semantic but what is optimal depends usually on factors that content (consisting only of communicative require human action, such as task completion or functions) and for dialogue acts with re- user satisfaction. alised semantic content. The remainder of this paper describes and eval- 1 Introduction uates a model-based approach to dialogue man- agement in which the decision process of taking Dialogue management is the activity of determin- a particular action given a dialogue state is guided ing how to behave as an interlocutor at a specific by data-driven dialogue act prediction. The ap- moment of time in a conversation: which action proach improves over n-gram language models can or should be taken at what state of the dia- and can be used in isolation or for user simula- logue. The systematic way in which an interlocu- tion, without yet providing a full alternative to re- tor chooses among the options for continuing a di- inforcement learning. alogue is often called a dialogue strategy. Coming up with suitable dialogue management 2 Using structural properties of strategies for dialogue systems is not an easy task. task-oriented dialogue Traditional methods typically involve manually One of the best known regularities that are ob- crafting and tuning frames or hand-crafted rules, served in dialogue are the two-part structures, requiring considerable implementation time and known as adjacency pairs (Schegloff, 1968), like cost. More recently, statistical methods are be- QUESTION-ANSWER or GREETING-GREETING. ing used to semi-automatically obtain models that A simple model of predicting a plausible next 1 can be trained and optimised using dialogue data. dialogue act that deals with such regularities could These methods are usually based on two assump- be based on bigrams, and to include more context tions. First, the training data is assumed to be also higher-order n-grams could be used. For in- representative of the communication that may be stance, Stolcke et al. (2000) explore n-gram mod- encountered in interaction. Second, it is assumed els based on transcribed words and prosodic in- that dialogue can be modelled as a Markov De- formation for SWBD-DAMSL dialogue acts in the cision Process (MDP) (Levin et al., 1998), which Switchboard corpus (Godfrey et al., 1992). After 1See e.g. (Young, 2002) for an overview. training back-off n-gram models (Katz, 1987) of

Proceedings of the EACL 2009 Workshop on Computational Linguistic Aspects of Grammatical Inference, pages 7–15, Athens, Greece, 30 March 2009. c 2009 Association for Computational Linguistics

7 different order using frequency smoothing (Witten the number of reoccurring patterns which could be and Bell, 1991), it was concluded that trigrams and used in the prediction. higher-order n-grams offer a small gain in predi- In compiling the symbols for the prediction ex- cation performance with respect to bigrams. periments, three aspects are important: the identi- Apart from adjacency pairs, there is a variety fication of interlocutors, the definition of dialogue of more complex re-occurring interaction patterns. acts, and multifunctionality in dialogue. For instance, the following utterances with cor- The dialogue act taxonomy that is used in the responding dialogue act types illustrate a clarifi- prediction experiments is that of DIT (Bunt, 2000). cation sub-dialogue within an information-request A dialogue act is defined as a pair consisting of a dialogue: communicative function (CF) and a semantic con- tent (SC): a =< CF,SC >. The DIT taxonomy 1 A: How do I do a fax? QUESTION distinguishes 11 dimensions of communicative 2 B: Do you want to send QUESTION functions, addressing information about the task or print one? domain, feedback, turn management, and other 3 A:Iwanttoprintit ANSWER generic aspects of dialogue (Bunt, 2006). There 4 B: Just press the grey button ANSWER are also functions, called the general-purpose functions, that may occur in any dimension. In quite some cases, particularly when dialogue con- Such structures have received considerable at- trol is addressed and dimension-specific functions tention and their models are often referred to as are realised, the SC is empty. General-purpose discourse/dialogue grammars (Polanyi and Scha, functions, by contrast, are always used in combi- 1984) or conversational/dialogue games (Levin nation with a realised SC. For example: and Moore, 1988). As also remarked by Levin (1999), predict- dialogue act ing and recognising dialogue games using n-gram utterance function semantic content models is not really successful. There are vari- next-step(X) ous causes for this. The flat horizontal structure of What to do next? SET-QUESTION n-grams does not allow (hierarchical) grouping of Press the button. SET-ANSWER press(Y) ∧ button(Y) symbols. This may weaken the predictive power and reduces the power of the representation since nested structures such as exemplified above cannot The SC —if realised— describes objects, prop- be represented in a straightforward way. erties, and events in the domain of conversation. A better solution would be to express the struc- In dialogue act prediction while taking multi- ture of dialogue games by a context-free grammar dimensionality into account, a dialogue D can be (CFG) representation in which the terminals are represented as a sequence of events in which an dialogue acts and the non-terminals denote con- event is a set of one dialogue act or multiple di- versational games. Construction of a CFG would alogue acts occurring simultaneously. The infor- require explicit specification of a discourse gram- mation concerning interlocutor and multifunction- mar, which could be done by hand, but it would be ality is encoded in a single symbol and denoted by a great advantage if CFGs could automatically be means of a n-tuple. Assuming that at most three functions can occur simultaneously, a 4-tuple is induced from the data. An additional advantage 2 of grammar induction is the possibility to assess needed : (interlocutor,da1,da2,da3). An ex- the frequency of typical patterns and a stochastic ample of a bigram of 4-tuples would then look as context-free grammar (SCFG) may be produced follows: which can be used for parsing the dialogue data. (A,, , ) , (B,, , ) 3 Sequencing dialogue acts Both n-gram language models and SCFG based Two symbols are considered to be identical when models work on sequences of symbols. Using the same speaker is involved and when the sym- more complex symbols increases data sparsity: bols both address the same functions. To make encoding more information increases the number 2Ignoring the half percent of occurrences with four simul- of unique symbols in the dataset and decreases taneous functions.

8 it easy to determine if two symbols are identical, 5 Dialogue grammars the order of elements in a tuple is fixed: func- To automatically induce patterns from dialogue tions that occur simultaneously are first ordered on data in an unsupervised way, grammatical infer- importance of dimension, and subsequently on al- ence (GI) techniques can be used. GI is a branch phabet. The task-related functions are considered of unsupervised machine learning that aims to find the most important, followed by feedback-related structure in symbolic sequential data. In this case, functions, followed by any other remaining func- the input of the GI algorithm will be sequences of tions. This raises the question how recognition dialogue acts. performance using multifunctional symbols com- pares against recognition performance using sym- 5.1 Dialogue Grammars Inducer bols that only encode the primary function For the induction of structure, a GI algorithm has been implemented that will be referred to as Dia- logue Grammars Inducer (DGI). This algorithm is N 4 -gram language models based on distributional clustering and alignment- based learning (van Zaanen and Adriaans, 2001; There exists a significant body of work on the use van Zaanen, 2002; Geertzen and van Zaanen, of language models in relation to dialogue man- 2004). Alignment-based learning (ABL) is a sym- agement. Nagata and Morimoto (1994) describe a bolic grammar inference framework that has suc- statistical model of discourse based on trigrams of cessfully been applied to several unsupervised ma- utterances classified by custom speech act types. chine learning tasks in natural language process- They report 39.7% prediction accuracy for the top ing. The framework accepts sequences with sym- candidate and 61.7% for the top three candidates. bols, aligns them with each other, and compares them to find interchangeable subsequences that In the context of the dialogue component of the mark structure. As a result, the input sequences speech-to-speech translation system VERBMO- are augmented with the induced structure. BIL, Reithinger and Maier (1995) use n-gram dia- The DGI algorithm takes as input time series of logue act probabilities to suggest the most likely dialogue acts, and gives as output a set of SCFGs. dialogue act. In later work, Alexandersson and The algorithm has five phases: Reithinger (1997) describe an approach which comes close to the work reported in this paper: Us- 1. SEGMENTATION: In the first phase of DGI, ing grammar induction, plan operators are semi- the time series are —if necessary— seg- automatically derived and combined with a statis- mented in smaller sequences based on a spe- tical disambiguation component. This system is cific time interval in which no communica- claimed to have an accuracy score of around 70% tion takes place. This is a necessary step in on turn management classes. task-oriented conversation in which there is ample time to discuss (and carry out) several Another study is that of Poesio and Mikheev related tasks, and an interaction often con- (1998), in which prediction based on the previous sists of a series of short dialogues. dialogue act is compared with prediction based on the context of dialogue games. Using the Map 2. ALIGNMENTLEARNING: In the second Task corpus annotated with ‘moves’ (dialogue phase a search space of possible structures, acts) and ‘transactions’ (games) they showed that called hypotheses, is generated by compar- by using higher dialogue structures it was possi- ing all input sequences with each other and ble to perform significantly better than a bigram by clustering sub-sequences that share simi- model approach. Using bigrams, 38.6% accuracy lar context. To illustrate the alignment learn- was achieved. Additionally taking game structure ing, consider the following input sequences: into account resulted in 50.6%; adding informa- tion about speaker change resulted in an accuracy A:SET-Q, B:PRO-Q, A:PRO-A, B:SET-A. of . % with bigrams, 54% with game structure. A:SET-Q, B:PAUSE, B:RESUME, B:SET-A. 41 8 A:SET-Q, B:SET-A. All studies discussed so far are only concerned with sequences of communicative functions, and The alignment learning compares all input disregard the semantic content of dialogue acts. sequences with each other, and produces the

9 hypothesised structures depicted below. The may be recognised, which can in turn be used ben- induced structure is represented using brack- eficially in dialogue management. eting. 5.2 A worked example [i A:SET-Q, [j B:PRO-Q, A:PRO-A, ]j B:SET-A. ]i In testing the algorithm, DGI has been used to [i A:SET-Q, [j B:PAUSE, A:RESUME, ]j B:SET-A. ]i [i A:SET-Q, [j ]j B:SET-A. ]i infer a set of SCFGs from a development set of 250 utterances of the DIAMOND corpus (see also The hypothesis j is generated because of the Section 6.1). Already for this small dataset, DGI similar context (which is underlined). The produced, using default parameters, 45 ‘dialogue hypothesis i, the full span, is introduced by games’. One of the largest produced structures default, as it might be possible that the se- was the following: quence is in itself a part of a longer sequence. 4 S ⇒ A:SET-Q , NTAX , NTBT , B:SET-A 4 NTAX ⇒ B:PRO-Q , NTFJ 3. SELECTIONLEARNING: The set of hypothe- 3 NTFJ ⇒ A:PRO-A ses that is generated during alignment learn- 1 NTFJ ⇒ A:PRO-A , A:CLARIFY ing contains hypotheses that are unlikely to 2 NTBT ⇒ B:PRO-Q , A:PRO-A ⇒ ∅ be correct. These hypotheses are filtered out, 2 NTBT overlapping hypotheses are eliminated to as- sure that it is possible to extract a context- In this figure, each CFG rule has a number in- free grammar, and the remaining hypotheses dicating how many times the rules has been used. are selected and remain in the bracketed out- One of the dialogue fragments that was used to in- put. The decision of which hypotheses to se- duce this structure is the following excerpt: lect and which to discard is based on a Viterbi beam search (Viterbi, 1967). utterance dialogue act A1 how do I do a short code? SET-Q 4. EXTRACTION: In the fourth phase, SCFG B1 do you want to program one? PRO-Q grammars are extracted from the remaining A2 no SET-A A3 I want to enter a kie* a short code CLARIFY hypotheses by means of recursive descent B2 you want to use a short code? PRO-Q parsing. Ignoring the stochastic informa- A4 yes PRO-A tion, a CFG of the above-mentioned example B3 press the VK button SET-A looks in terms of grammar rules as depicted below: Unfortunately, many of the 45 induced struc- tures were very small or involved generalisations S ⇒ A:SET-Q J B:SET-A already based on only two input samples. To en- J ⇒ B:PRO-Q A:PRO-A J ⇒ B:PAUSE A:RESUME sure that the grammars produced by DGI gen- J ⇒ ∅ eralise better and are less fragmented, a post- processing step has been added which traverses the grammars and eliminates generalisations based 5. FILTERING: In the last phase, the SCFG on a low number of samples. In practice, this grammars that have small coverage or involve means that the post-processing requires the re- many non-terminals are filtered out, and the maining grammatical structure to be presented N remaining SCFG grammars are presented as times or more in the data.3. The algorithm without the output of DGI. post-processing will be referred to as DGI1; the algorithm with post-processing as DGI2. Depending on the mode of working, the DGI algorithm can generate a SCFG covering the com- 6 Act prediction experiments plete input or can generate a set of SCFGs. In the former mode, the grammar that is generated can be To determine how to behave as an interlocutor at used for parsing sequences of dialogue acts and by a specific moment of time in a conversation, the doing so suggests ways to continue the dialogue. DGI algorithm can be used to infer a SCFG that In the latter mode, by parsing each grammar in the models the structure of the interaction. The SCFG set of grammars that are expected to represent di- 3N = 2 by default, but may increase with the size of the alogue games in parallel, specific dialogue games training data.

10 can then be used to suggest a next dialogue act guished: action predicates, element predicates, to continue the dialogue. In this section, the per- and property predicates. These types have a fixed formance of the proposed SCFG based dialogue order. The action predicates appear before element model is compared with the performance of the predicates, which appear in turn before property well-known n-gram language models, both trained predicates. This allows to simplify the semantic on intentional level, i.e. on sequences of sets of di- content for the purpose of reducing data sparsity alogue acts. in act prediction experiments, by stripping away e.g. property predicates. For instance, if desired 6.1 Data the SC of utterance 3 in the example could be sim- The task-oriented dialogues used in the dialogue plified to that of utterance 2, making the semantics act prediction tasks were drawn from the DIA- less detailed but still meaningful. MOND corpus (Geertzen et al., 2004), which con- tains human-machine and human-human Dutch 6.2 Methodology and metrics dialogues that have an assistance seeking na- Evaluation of overall performance in communi- ture. The dataset used in the experiments con- cation is problematic; there are no generally ac- tains 1, 214 utterances representing 1, 592 func- cepted criteria as to what constitutes an objective tional segments from the human-human part of and sound way of comparative evaluation. An corpus. In the domain of the DIAMOND data, often-used paradigm for dialogue system evalua- i.e. operating a fax device, the predicates and argu- tion is PARADISE (Walker et al., 2000), in which ments in the logical expressions of the SC of the the performance metric is derived as a weighted dialogue acts refer to entities, properties, events, combination of subjectively rated user satisfac- and tasks in the application domain. The appli- tion, task-success measures and dialogue cost. cation domain of the fax device is complex but Evaluating if the predicted dialogue acts are con- small: the domain model consists of 70 entities sidered as positive contributions in such a way with at most 10 properties, 72 higher-level actions would require the model to be embedded in a fully or tasks, and 45 different settings. working dialogue system. Representations of semantic content are often To assess whether the models that are learned expressed in some form of predicate logic type produce human-like behaviour without resorting formula. Examples are Quasi Logical Forms (Al- to costly user interaction experiments, it is needed shawi, 1990), Dynamic Predicate Logic (Groe- to compare their output with real human responses nendijk and Stokhof, 1991), and Underspecified given in the same contexts. This will be done by Discourse Representation Theory (Reyle, 1993). deriving a model from one part of a dialogue cor- The SC in the dataset is in a simplified first order pus and applying the model on an ’unseen’ part logic similar to quasi logical forms, and is suitable of the corpus, comparing the suggested next dia- to support feasible reasoning, for which also theo- logue act with the observed next dialogue act. To rem provers, model builders, and model checkers measure the performance, accuracy is used, which can be used. The following utterances and their is defined as the proportion of suggested dialogue corresponding SC characterise the dataset: acts that match the observed dialogue acts. In addition to the accuracy, also perplexity is 1 wat moet ik nu doen? (what do I have to do now?) used as metric. Perplexity is widely used in re- λx . next-step(x) lation to speech recognition and language models, 2 druk op een toets and can in this context be understood as a metric (press a button) that measures the number of equiprobable possi- λx . press(x) ∧ button(x) ble choices that a model faces at a given moment. 3 druk op de groene toets Perplexity, being related to entropy is defined as (press the green button) follows: λx . press(x) ∧ button(x) ∧ color(x,’green’)

4 wat zit er boven de starttoets? (what is located above the starttoets?) λx . loc-above(x,’button041’) Entropy = − X p(wi|h) · log2 p(wi|h) i Three types of predicate groups are distin- Perplexity = 2Entropy

11 where h denotes the conditioned part, i.e. wi−1 function results in better accuracy scores of 5.9% in the case of bigrams and wi−2, wi−1 in the case on average, despite the increase in data sparsity. A of trigrams, et cetera. In sum, accuracy could be similar effect has also been reported by Stolcke et described as a measure of correctness of the hy- al. (2000). pothesis and perplexity could be described as how Only for the 5-gram language model, the data probable the correct hypothesis is. become too sparse to learn reliably a language For all n-gram language modelling tasks re- model from. There is again an increase in per- ported, good-turing smoothing was used (Katz, formance when also the last two positions in the 1987). To reduce the effect of imbalances in the 4-tuple are used and all available dialogue act as- dialogue data, the results were obtained using 5- signments are available. It should be noted, how- fold cross-validation. ever, that this increase has less impact than adding To have an idea how the performance of both the speaker identity. The best performing n-gram the n-gram language models and the SCFG mod- language model achieved 66.4% accuracy; the els relate to the performance of a simple heuris- best SCFG model achieved 78.9% accuracy. tic, a baseline has been computed which suggests a majority class label according to the interlocutor 6.4 Results for dialogue acts role in the dialogue. The information seeker has The scores for prediction of dialogue acts, includ- SET-Q and the information provider has SET-A as ing SC, are presented in the left part of Table 2. majority class label. The presentation is similar to Table 1: for each of the three kinds of symbols, accuracy and perplex- 6.3 Results for communicative functions ity were calculated. For dialogue acts that may in- The scores for communicative function prediction clude semantic content, computing a useful base- are presented in Table 1. For each of the three line is not obvious. The same baseline as for com- kinds of symbols, accuracy and perplexity are cal- municative functions was used, which results in culated: the first two columns are for the main CF, lower scores. the second two columns are for the combination The table shows that the attempts to learn to of speaker identity and main CF, and the third two predict additionally the semantic content of utter- columns are for the combination of speaker iden- ances quickly run into data sparsity problems. It tity and all CFs. The scores for the latter two cod- turned out to be impossible to make predictions ings could not be calculated for the 5-gram model, from 4-grams and 5-grams, and for 3-grams the as the data were too sparse. combination of speaker and all dialogue acts could As was expected, there is an improvement in not be computed. Training the SCFGs, by con- both accuracy (increasing) and perplexity (de- trast, resulted in fewer problems with data sparsity, creasing) for increasing n-gram order. After the as the models abstract quickly. 4-gram language model, the scores drop again. As with predicting communicative functions, This could well be the result of insufficient train- the SCFG models show better performance than ing data, as the more complex symbols could not the n-gram language models, for which the 2- be predicted well. grams show slightly better results than the 3- Both language models and SCFG models per- grams. Where there was a notable performance form better than the baseline, for all three groups. difference between DGI1 and DGI2 for CF pre- The two SCFG models, DGI1 and DGI2, clearly diction, for dialogue act prediction there is only a outperform the n-gram language models with a very little difference, which is insignificant con- substantial difference in accuracy. Also the per- sidering the relatively high standard deviation. plexity tends to be lower. Furthermore, model This small difference is explained by the fact that DGI2 performs clearly better than model DGI1, DGI2 becomes less effective as the size of the which indicates that the ‘flattening’ of non- training data decreases. terminals which is described in Section 5 results As with CF prediction, it can be concluded that in better inductions. providing the speaker identity with the main dia- When comparing the three groups of sequences, logue act results in better scores, but the difference it can be concluded that providing the speaker is less big than observed with CF prediction due to identity combined with the main communicative the increased data sparsity.

12 Table 1: Communicative function prediction scores for n-gram language models and SCFGs in accuracy (acc, in percent) and perplexity (pp). CFmain denotes the main communicative function, SPK speaker identity, and CFall all occurring communicative functions.

CFmain SPK + CFmain SPK + CFall acc pp acc pp acc pp baseline 39.1±0.23 24.2±0.19 44.6±0.92 22.0±0.25 42.9±1.33 23.7±0.41

2-gram 53.1±0.88 17.9±0.35 58.3±1.84 16.8±0.31 61.1±1.65 16.3±0.59 3-gram 58.6±0.85 17.1±0.47 63.0±1.98 14.5±0.26 65.9±1.92 14.0±0.23 4-gram 60.9±1.12 16.7±0.15 65.4±1.62 15.2±1.07 66.4±2.03 14.2±0.44 5-gram 60.3±0.43 18.6±0.21 - - - -

DGI1 67.4±3.05 18.3±1.28 74.6±1.94 14.8±1.47 76.5±2.13 13.9±0.35 DGI2 71.8±2.67 16.1±1.25 78.3±2.50 14.0±2.39 78.9±1.98 13.6±0.35

Table 2: Dialogue act prediction scores for n-gram language models and SCFGs. DAmain denotes the dialogue act with the main communicative function, and DAall all occurring dialogue acts.

DAmain SPK + DAmain SPK + DAall fullSC simplifiedSC acc pp acc pp acc pp acc pp baseline 18.5±2.01 31.0±1.64 19.3±1.79 27.6±0.93 18.2±1.93 31.6±1.38 18.2±1.93 31.6±1.38

2-gram 31.2±1.42 28.5±1.03 34.6±1.51 24.7±0.62 35.1±1.30 26.9±0.47 37.5±1.34 26.2±2.37 3-gram 29.0±1.14 34.7±2.82 31.9±1.21 30.5±2.06 - - 29.1±1.28 28.0±2.59 4-gram ------5-gram ------

DGI1 38.8±3.27 25.1±0.94 42.5±0.96 25.0±1.14 42.9±2.44 27.3±1.98 46.6±2.01 24.6±2.24 DGI2 39.2±2.45 25.0±1.28 42.7±1.03 25.3±0.99 42.4±2.19 28.0±1.57 46.4±1.94 24.7±2.55

The prediction scores of dialogue acts with full based model can capture regularities that have semantic content and simplified semantic content gaps, and allow to model long(er) distance rela- are presented in the right part of Table 2. For both tions. Both algorithms work on sequences and cases multifunctionality is taken into account by hence are easily susceptible to data-sparsity when including all occurring communicative functions the symbols in the sequences get more complex. in each symbol. As can be seen from the table, The SCFG approach, though, has the advantage the simplification of the semantic content leads to that symbols can be clustered in the non-terminals improvements in the prediction performance for of the grammar, which allows more flexibility. ++ both types of model. The best n-gram language The multidimensional nature of the DIT model improved with 2.4% accuracy from 35.1% functions can be adequately encoded in the sym- to 37.5%; the best SCFG-based model improved bols of the sequences. Keeping track of the inter- with 3.7% from 42.9% to 46.6%. locutor and including not only the main commu- Moreover, the simplification of the semantic nicative function but also other functions that oc- content reduced the problem of data-sparsity, mak- cur simultaneously results in better performance ing it also possible to predict based on 3-grams even though it decreases the amount of data to although the accuracy is considerably lower than learn from. that of the 2-gram model. The prediction experiments based on main com- 7 Discussion municative functions assume that in case of multi- functionality, a main function can clearly be iden- Both n-gram language models and SCFG based tified. Moreover, it is assumed that task-related models have their strengths and weaknesses. n- functions are more important than feedback func- gram models have the advantage of being very ro- tions or other functions. For most cases, these as- bust and they can be easily trained. The SCFG sumptions are justified, but in some cases they are

13 problematic. For instance, in a heated discussion, cluded that the algorithm outperforms the n-gram the turn management function could be considered models: on the task of predicting the communica- more important for the dialogue than a simultane- tive functions, the best performing n-gram model ously occurring domain specific function. In other achieved 66.4% accuracy; the best SCFG model cases, it is impossible to clearly identify a main achieved 78.9% accuracy. Predicting the seman- function as all functions occurring simultaneously tic content in combination with the communica- are equally important to the dialogue. tive functions is difficult, as evidenced by moder- In general, n-grams of a higher order have a ate scores. Obtaining lower degree n-gram lan- higher predictability and therefore a lower per- guage models is feasible, whereas higher degree plexity. However, using high order n-grams is models are not trainable. Prediction works better problematic due to sparsity of training data, which with the SCFG models, but does not result in con- clearly is the case with 4-grams, and particularly vincing scores. As the corpus is small, it is ex- with 5-grams in combination with complex sym- pected that with scaling up the available training bols as for CF prediction. data, scores will improve for both tasks. Considerably more difficult is the prediction of Future work in this direction can go in sev- dialogue acts with realised semantic content, as eral directions. First, the grammar induction ap- is evidenced in the differences in accuracy and proach shows potential of learning dialogue game- perplexity for all models. Considering that the like structures unsupervised. The performance on data set, with about 1, 600 functional segments, this task could be tested and measured by applying is rather small, the statistical prediction of logical the algorithm on corpus data that have been anno- expressions increases data sparsity to such a de- tated with dialogue games. Second, the models gree that from the n-gram language models, only could also be extended to incorporate more infor- 2-gram (and 3-grams to some extent) could be mation than dialogue acts alone. This could make trained. The SCFG models can be trained for both comparisons with the performance obtained with CF prediction and dialogue act prediction. reinforcement learning or with Bayesian networks As noted in Section 6.2, objective evaluation of interesting. Third, it would be interesting to learn dialogue strategies and behaviour is difficult. The and apply the same models on other kinds of con- evaluation approach used here compares the sug- versation, such as dialogue with more than two in- gested next dialogue act with the next dialogue act terlocutors. Fourth, datasets could be drawn from as observed. This is done for each dialogue act in a large corpus that covers dialogues on a small the test set. This evaluation approach has the ad- but complex domain. This makes it possible to vantage that the evaluation metric can easily be un- evaluate according to the possible continuations derstood and computed. The approach, however, as found in the data for situations with similar di- is also very strict: in a given dialogue context, con- alogue context, rather than to evaluate according tinuations with various types of dialogue acts may to a single possible continuation. Last, the rather all be equally appropriate. To also take other pos- unexplored parameter space of the DGI algorithm sible contributions into account, a rich dataset is might be worth exploring in optimising the sys- required in which interlocutors act differently in tem’s performance. similar dialogue context with a similar established common ground. Moreover, such a dataset should contain for each of these cases with similar dia- References logue context a representative set of samples. Jan Alexandersson and Norbert Reithinger. 1997. Learning dialogue structures from a corpus. In 8 Conclusions and future work Proceedings of Eurospeech 1997, pages 2231–2234, Rhodes, Greece, September.

An approach to the prediction of communicative Hiyan Alshawi. 1990. Resolving quasi logical forms. functions and dialogue acts has been presented Computational Linguistics, 16(3):133–144. that makes use of grammatical inference to auto- matically extract structure from corpus data. The Harry Bunt. 2000. Dialogue pragmatics and context specification. In Harry Bunt and William Black, ed- algorithm, based on alignment-based learning, has itors, Abduction, Belief and Context in Dialogue; been tested against a baseline and several n-gram Studies in Computational Pragmatics, pages 81– language models. From the results it can be con- 150. John Benjamins, Amsterdam, The Netherlands.

14 Harry Bunt. 2006. Dimensions in dialogue annota- act recognition: Experimental results using maxi- tion. In Proceedings of the 5th International Confer- mum entropy estimation. In Proceedings Interna- ence on Language Resources and Evaluation (LREC tional Conference on Spoken Language Processing 2006), pages 1444–1449, Genova, Italy, May. (ICSLP-98), Sydney, Australia, December.

Jeroen Geertzen and Menno M. van Zaanen. 2004. Livia Polanyi and Remko Scha. 1984. A syntactic ap- Grammatical inference using suffix trees. In proach to discourse semantics. In Proceedings of Proceedings of the 7th International Colloquium the 10th international conference on Computational on Grammatical Inference (ICGI), pages 163–174, linguistics, pages 413–419, Stanford, CA, USA. Athens, Greece, October. Norbert Reithinger and Elisabeth Maier. 1995. Uti- lizing statistical dialogue act processing in VERB- Jeroen Geertzen, Yann Girard, and Roser Morante. MOBIL. In Proceedings of the 33rd annual meeting 2004. The DIAMOND project. Poster at the 8th on the Association for Computational Linguistics Workshop on the Semantics and Pragmatics of Dia- (ACL), pages 116–121, Cambridge, Massachusetts. logue (CATALOG 2004), Barcelona, Spain, July. Association for Computational Linguistics (ACL). John Godfrey, Edward Holliman, and Jane McDaniel. Uwe Reyle. 1993. Dealing with ambiguities by under- 1992. SWITCHBOARD: Telephone speech corpus specification: Construction, representation and de- for research and development. In Proceedings of the duction. Journal of Semantics, 10(2):123–179. ICASSP-92, pages 517–520, San Francisco, USA. Emanuel A. Schegloff. 1968. Sequencing in con- Jeroen Groenendijk and Martin Stokhof. 1991. Dy- versational openings. American Anthropologist, namic predicate logic. Linguistics and Philosophy, 70:1075–1095. 14(1):39–100. Andreas Stolcke, Klaus Ries, Noah Coccaro, Eliza- beth Shriberg, Rebecca Bates, Daniel Jurafsky, Paul Slava M. Katz. 1987. Estimation of probabilities from Taylor, Rachel Martin, Carol Van Ess-Dykema, and sparse data for the language model component of a Marie Meteer. 2000. Dialogue act modeling for speech recognizer. IEEE Transactions on Acoustics, automatic tagging and recognition of conversational Speech, and Signal Processing, 35(3):400–401. speech. Computational Linguistics, 26(3):339–373. Oliver Lemon, Kallirroi Georgila, and James Hender- Richard S. Sutton and Andrew G. Barto. 1998. Re- son. 2006. Evaluating effectiveness and portabil- inforcement Learning: An Introduction (Adaptive ity of reinforcement learned dialogue strategies with Computation and Machine Learning). MIT Press, real users: The talk towninfo evaluation. In Spoken March. Language Technology Workshop, pages 178–181. Menno van Zaanen and Pieter W. Adriaans. 2001. Joan A. Levin and Johanna A. Moore. 1988. Dialogue- Comparing two unsupervised grammar induction games: metacommunication structures for natural systems: Alignment-Based Learning vs. EMILE. language interaction. Distributed Artificial Intelli- Technical Report TR2001.05, University of Leeds, gence, pages 385–397. Leeds, UK, March. Menno M. van Zaanen. 2002. Bootstrapping Structure Esther Levin, Roberto Pieraccini, and Wieland Eck- into Language: Alignment-Based Learning. Ph.D. ert. 1998. Using markov decision process for thesis, University of Leeds, Leeds, UK, January. learning dialogue strategies. In Proceedings of the ICASSP’98, pages 201–204, Seattle, WA, USA. Andrew J. Viterbi. 1967. Error bounds for convolu- tional codes and an asymptotically optimum decod- Lori Levin, Klaus Ries, Ann Thym´e-Gobbel, and Alon ing algorithm. IEEE Transactions on Information Lavie. 1999. Tagging of speech acts and dialogue Theory, 13(2):260–269, April. games in spanish call home. In Proceedings of ACL- 99 Workshop on Discourse Tagging, College Park, Marilyn Walker, Candace Kamm, and Diane Litman. MD, USA. 2000. Towards developinggeneral models of usabil- ity with paradise. Natural Language Engineering, Masaaki Nagata and Tsuyoshi Morimoto. 1994. First 6(3-4):363–377. steps towards statistical modeling of dialogue to pre- dict the speech act type of the next utterance. Speech Ian H. Witten and Timothy C. Bell. 1991. The zero- Communication, 15(3-4):193–203. frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Tim Paek. 2006. Reinforcement learning for spoken Transactions on Information Theory, 37(4):1085– dialogue systems: Comparing strenghts and weak- 1094. nesses for practical deployment. In Interspeech Steve Young. 2002. The statistical approach to the Workshop on ”Dialogue on Dialogues”. design of spoken dialogue systems. Technical Re- port CUED/F-INFENG/TR.433, Engineering De- Massimo Poesio and Andrei Mikheev. 1998. The partment, Cambridge University, UK, September. predictive power of game structure in dialogue

15 Experiments Using OSTIA for a Language Production Task

Dana Angluin and Leonor Becerra-Bonache Department of Computer Science, Yale University P.O.Box 208285, New Haven, CT, USA {dana.angluin, leonor.becerra-bonache}@yale.edu

Abstract tions we define sequential transducers to produce pairs consisting of an utterance The phenomenon of meaning-preserving in that language and its meaning. Using corrections given by an adult to a child this data we empirically explore the prop- involves several aspects: (1) the child erties of the OSTIA and DD-OSTIA al- produces an incorrect utterance, which gorithms for the tasks of learning compre- the adult nevertheless understands, (2) the hension and production in this domain, to adult produces a correct utterance with the assess whether they may provide a basis same meaning and (3) the child recognizes for a model of meaning-preserving correc- the adult utterance as having the same tions. meaning as its previous utterance, and takes that as a signal that its previous ut- 1 Introduction terance is not correct according to the adult The role of corrections in language learning has grammar. An adequate model of this phe- recently received substantial attention in Gram- nomenon must incorporate utterances and matical Inference. The kinds of corrections con- meanings, account for how the child and sidered are mainly syntactic corrections based on adult can understand each other’s mean- proximity between strings. For example, a cor- ings, and model how meaning-preserving rection of a string may be given by using edit corrections interact with the child’s in- distance (Becerra-Bonache et al., 2007; Kinber, creasing mastery of language production. 2008) or based on the shortest extension of the In this paper we are concerned with how queried string (Becerra-Bonache et al., 2006), a learner who has learned to comprehend among others. In these approaches semantic in- utterances might go about learning to pro- formation is not used. duce them. However, in natural situations, a child’s er- We consider a model of language com- roneous utterances are corrected by her parents prehension and production based on finite based on the meaning that the child intends to ex- sequential and subsequential transducers. press; typically, the adult’s corrections preserve Utterances are modeled as finite sequences the intended meaning of the child. Adults use cor- of words and meanings as finite sequences rections in part as a way of making sure they have of predicates. Comprehension is inter- understood the child’s intentions, in order to keep preted as a mapping of utterances to mean- the conversation “on track”. Thus the child’s ut- ings and production as a mapping of mean- terance and the adult’s correction have the same ings to utterances. Previous work (Castel- meaning, but the form is different. As Chouinard lanos et al., 1993; Pieraccini et al., 1993) and Clark point out (2003), because children at- has applied subsequential transducers and tend to contrasts in form, any change in form that the OSTIA algorithm to the problem of does not mark a different meaning will signal to learning to comprehend language; here we children that they may have produced something apply them to the problem of learning to that is not acceptable in the target language. Re- produce language. For ten natural lan- sults in (Chouinard and Clark, 2003) show that guages and a limited domain of geomet- adults reformulate erroneous child utterances of- ric shapes and their properties and rela- ten enough to help learning. Moreover, these re-

Proceedings of the EACL 2009 Workshop on Computational Linguistic Aspects of Grammatical Inference, pages 16–23, Athens, Greece, 30 March 2009. c 2009 Association for Computational Linguistics

16 sults show that children can not only detect differ- For example, from (gr, tr, le, sq) the child may ences between their own utterance and the adult produce green triangle left square. Our goal is to reformulation, but that they do make use of that model how the learner might move from this tele- information. graphic speech to speech that is grammatical in the Thus in some natural situations, corrections adult’s sense. Moreover, we would like to find a have a semantic component that has not been taken formal framework in which corrections (in form of into account in previous Grammatical Inference expansions, for example, the green triangle to the studies. Some interesting questions arise: What left of the square) can be given to the child dur- are the effects of corrections on learning syntax? ing the intermediate stages (before the learner is Can corrections facilitate the language learning able to produce grammatically correct utterances) process? One of our long-term goals is to find a to study their effect on language learning. formal model that gives an account of this kind We thus propose to model the problem of of correction and in which we can address these child language production as a machine trans- questions. Moreover, such a model might allow us lation problem, that is, as the task of translat- to show that semantic information can simplify the ing a sequence of predicate symbols (representing problem of learning formal languages. the meaning of an utterance) into a correspond- A simple computational model of semantics and ing utterance in a natural language. In this pa- context for language learning incorporating se- per we explore the possibility of applying existing mantics was proposed in (Angluin and Becerra- automata-theoretic approaches to machine transla- Bonache, 2008). This model accommodates two tion to model language production. In Section 2, different tasks: comprehension and production. we describe the use of subsequential transducers That paper focused only on the comprehension for machine translation tasks and review the OS- task and formulated the learning problem as fol- TIA algorithm to learn them (Oncina, 1991). In lows. The teacher provides to the learner several Section 3, we present our model of how the learner example pairs consisting of a situation and an ut- can move from telegraphic to adult speech. In Sec- terance denoting something in the situation; the tion 4, we present the results of experiments in the goal of the learner is to learn the meaning func- model made using OSTIA. Discussion of these re- tion, allowing the learner to comprehend novel ut- sults is presented in Section 5 and ideas for future terances. The results in that paper show that under work are in Section 6. certain assumptions, a simple algorithm can learn to comprehend an adult’s utterance in the sense of 2 Learning Subsequential Transducers producing the same sequence of predicates, even Subsequential transducers (SSTs) are a formal without mastering the adult’s grammar. For exam- model of translation widely studied in the liter- ple, receiving the utterance the blue square above ature. SSTs are deterministic finite state mod- the circle, the learner would be able to produce the els that allow input-output mappings between lan- sequence of predicates (bl, sq, ab, ci). guages. Each edge of an SST has an associated In this paper we focus on the production task, input symbol and output string. When an in- using sequential and subsequential transducers to put string is accepted, an SST produces an out- model both comprehension and production. Adult put string that consists of concatenating the out- production can be modeled as converting a se- put substrings associated with sequence of edges quence of predicates into an utterance, which can traversed, together with the substring associated be done with access to the meaning transducer for with the last state reached by the input string. Sev- the adult’s language. eral phenomena in natural languages can be eas- However, we do not assume that the child ini- ily represented by means of SSTs, for example, tially has access to the meaning transducer for the different orders of noun and adjective in Span- the adult’s language; instead we assume that the ish and English (e.g., un cuadrado rojo - a red child’s production progresses through different square). Formal and detailed definitions can be stages. Initially, child production is modeled as found in (Berstel, 1979). consisting of two different tasks: finding a correct For any SST it is always possible to find an sequence of predicates, and inverting the meaning equivalent SST that has the output strings assigned function to produce a kind of “telegraphic speech”. to the edges and states so that they are as close to

17 the initial state as they can be. This is called an and Varo, 1996). Experiments with a language un- Onward Subsequential Transducer (OST). derstanding task gave better results with OSTIA- It has been proved that SSTs are learnable in DR than with OSTIA (Castellanos et al., 1993). the limit from a positive presentation of sentence Another extension is DD-OSTIA (Oncina, 1998), pairs by an efficient algorithm called OSTIA (On- which instead of considering a lexicographic order ward Subsequential Transducer Inference Algo- to merge states, uses a heuristic order based on a rithm) (Oncina, 1991). OSTIA takes as input a fi- measure of the equivalence of the states. Experi- nite training set of input-output pairs of sentences, ments in (Oncina, 1998) show that better results and produces as output an OST that generalizes can be obtained by using DD-OSTIA in certain the training pairs. The algorithm proceeds as fol- translation tasks from Spanish to English. lows (this description is based on (Oncina, 1998)): 3 From telegraphic to adult speech • A prefix tree representation of all the input sentences of the training set is built. Empty To model how the learner can move from tele- strings are assigned as output strings to both graphic speech to adult speech, we reduce this the internal nodes and the edges of this tree, problem to a translation problem, in which the and every output sentence of the training set learner has to learn a mapping from sequences of is assigned to the corresponding leaf of the predicates to utterances. As we have seen in the tree. The result is called a tree subsequential previous section, SSTs are an interesting approach transducer. to machine translation. Therefore, we explore the possibility of modeling language production using • An onward tree subsequential transducer SSTs and OSTIA, to see whether they may pro- equivalent to the tree subsequential trans- vide a good framework to model corrections. ducer is constructed by moving the longest As described in (Angluin and Becerra-Bonache, common prefixes of the output strings, level 2008), after learning the meaning function the by level, from the leaves of the tree towards learner is able to assign correct meanings to ut- the root. terances, and therefore, given a situation and an utterance that denotes something in the situation, • Starting from the root, all pairs of states of the learner is able to point correctly to the object the onward tree subsequential transducer are denoted by the utterance. To simplify the task considered in order, level by level, and are we consider, we make two assumptions about the merged if possible (i.e., if the resulting trans- learner at the start of the production phase: (1) ducer is subsequential and does not contra- the learner’s lexicon represents a correct meaning dict any pair in the training set). function and (2) the learner can generate correct sequences of predicates. SSTs and OSTIA have been successfully ap- Therefore, in the initial stage of the production plied to different translation tasks: Roman numer- phase, the learner is able to produce a kind of als to their decimal representations, numbers writ- “telegraphic speech” by inverting the lexicon con- ten in English to their Spanish spelling (Oncina, structed during the comprehension stage. For ex- 1991) and Spanish sentences describing simple ample, if the sequence of predicates is (bl, sq, ler, visual scenes to corresponding English and Ger- ci), and in the lexicon blue is mapped to bl, square man sentences (Castellanos et al., 1994). They to sq, right to ler and circle to ci, then by invert- have also been applied to language understanding ing this mapping, the learner would produce blue tasks (Castellanos et al., 1993; Pieraccini et al., square right circle. 1993). In order to explore the capability of SSTs and Moreover, several extensions of OSTIA have OSTIA to model the next stage of language pro- been introduced. For example, OSTIA-DR incor- duction (from telegraphic to adult speech), we take porates domain (input) and range (output) mod- the training set to be input-output pairs each of els in the learning process, allowing the algorithm which contains as input a sequence of predicates to learn SSTs that accept only sentences compat- (e.g., (bl, sq, ler, ci)) and as output the correspond- ible with the input model and produce only sen- ing utterance in a natural language (e.g., the blue tences compatible with the output model (Oncina square to the right of the circle). In this example,

18 the learner must learn to include appropriate func- ducers produced. Figure 1 shows an example of tion words. In other languages, the learner may a transducer produced by OSTIA after just ten have to learn a correct choice of words determined pairs of input-output examples for Spanish. This by gender, case or other factors. (Note that we are transducer correctly translates the ten predicate se- not yet in a position to consider corrections.) quences used to construct it, but the data is not sufficient for OSTIA to generalize correctly in all 4 Experiments cases, and many other correct meanings are still incorrectly translated. For example, the sequence Our experiments were made for a limited domain (ci, bl) is translated as el circulo a la izquierda del of geometric shapes and their properties and re- circulo verde azul instead of el circulo azul. lations. This domain is a simplification of the The transducers produced after convergence by Miniature Language Acquisition task proposed by OSTIA and DD-OSTIA correctly translate all 444 Feldman et al. (Feldman et al., 1990). Previous possible correct meanings. Examples for Spanish applications of OSTIA to language understanding are shown in Figure 2 (OSTIA) and Figure 3 (DD- and machine translation have also used adapta- OSTIA). Note that although they correctly trans- tions and extensions of the Feldman task. late all 444 correct meanings, the behavior of these In our experiments, we have predicates for three two transducers on other (incorrect) predicate se- different shapes (circle (ci), square (sq) and tri- quences is different, for example on (tr, tr). angle (tr)), three different colors (blue (bl), green (gr) and red (re)) and three different relations (to sq/ bl/ azul ci/ tr/ encima del triangulo the left of (le), to the right of (ler), and above (ab)). sq/ el cuadrado bl/ azul ci/ verde encima del ci/el circulo a la gr/ circulo azul izquierda del ler/ a la derecha bl/ tr/ el triangulo We consider ten different natural languages: Ara- circulo verde del cuadrado re/ rojo le/ bic, English, Greek, Hebrew, Hindi, Hungarian, ler/ ab/ 1 re/ rojo a la derecha 2 3 Mandarin, Russian, Spanish and Turkish. del cuadrado We created a data sequence of input-output pairs, each consisting of a predicate sequence and Figure 1: Production task, OSTIA. A transducer a natural language utterance. For example, one produced using 10 random unique input-output pair for Spanish is ((ci, re, ler, tr), el circulo rojo pairs (predicate sequence, utterance) for Spanish. a la derecha del triangulo). We ran OSTIA on ini- tial segments of the sequence of pairs, of lengths 10, 20, 30,..., to produce a sequence of subse- le/ a la izquierda del quential transducers. The whole data sequence ler/ a la derecha del was used to test the correctness of the transducers ab/ encima del bl/ azul generated during the process. An error is counted le/ a la izquierda del re/ rojo whenever given a data pair (x, y), the subsequen- ler/ a la derecha del gr/ verde 0 0 ab/ encima del sq/ cuadrado tial transducer translates x to y , and y 6= y. We bl/ azul re/ rojo ci/ circulo say that OSTIA has converged to a correct trans- sq/ el cuadrado gr/ verde tr/ triangulo ci/ el circulo ducer if all the transducers produced afterwards tr/ el triangulo 1 2 have the same number of states and edges, and 0 errors on the whole data sequence. To generate the sequences of input-output pairs, Figure 2: Production task, OSTIA. A transducer for each language we constructed a meaning trans- produced (after convergence) by using random ducer capable of producing the 444 different pos- unique input-output pairs (predicate sequence, ut- sible meanings involving one or two objects. We terance) for Spanish. randomly generated 400 unique (non-repeated) input-output pairs for each language. This process Different languages required very different was repeated 10 times. In addition, to investigate numbers of data pairs to converge. Statistics on the effect of the order of presentation of the input- the number of pairs needed until convergence for output pairs, we repeated the data generation pro- OSTIA for all ten languages for both random cess for each language, sorting the pairs according unique and random unique sorted data sequences to a length-lex ordering of the utterances. are shown in Table 1. Because better results were We give some examples to illustrate the trans- reported using DD-OSTIA in machine translation

19 bl/ azul Language # Pairs # Sorted Pairs re/ rojo Arabic 80 140 gr/ verde sq/ el cuadrado English 85 180 ci/ el circulo le/ a la izquierda del Greek 350 400 tr/ el triangulo ler/ a la derecha del Hebrew 65 80 ab/ encima del Hindi 175 120 1 sq/ cuadrado 2 ci/ circulo Hungarian 245 140 tr/ triangulo Mandarin 40 150 Russian 185 210 Spanish 80 150 Figure 3: Production task, DD-OSTIA. A trans- Turkish 50 40 ducer produced (after convergence) using random unique input-output pairs (predicate-sequence, ut- Table 2: Production task, DD-OSTIA. The entries terance) for Spanish. give the median number of input-output pairs un- til convergence in 10 runs. For Greek, Hindi and Language # Pairs # Sorted Pairs Hungarian, the median for the unsorted case is cal- Arabic 150 200 culated using all 444 random unique pairs, instead English 200 235 of 400. Greek 375 400 Languages #states #edges Hebrew 195 30 Arabic 2 20 Hindi 380 350 English 2 20 Hungarian 365 395 Greek 9 65 Mandarin 45 150 Hebrew 2 20 Russian 270 210 Hindi 7 58 Spanish 190 150 Hungarian 3 20 Turkish 185 80 Mandarin 1 10 Russian 3 30 Table 1: Production task, OSTIA. The entries give Spanish 2 20 the median number of input-output pairs until con- Turkish 4 31 vergence in 10 runs. For Greek, Hindi and Hun- garian, the median for the unsorted case is calcu- Table 3: Production task, OSTIA. Sizes of trans- lated using all 444 random unique pairs, instead of ducers at convergence. 400.

The comprehension task was studied by Oncina tasks (Oncina, 1998), we also tried using DD- et al. (Castellanos et al., 1993). They used En- OSTIA for learning to translate a sequence of glish sentences, with a more complex version of predicates to an utterance. We used the same se- the Feldman task domain and more complex se- quences of input-output pairs as in the previous mantic representations than we use. Our results experiment. The results obtained are shown in Ta- are presented in Table 5. The number of states ble 2. and edges of the transducers after convergence is We also report the sizes of the transducers shown in Table 6. learned by OSTIA and DD-OSTIA. Table 3 and Table 4 show the numbers of states and edges 5 Discussion of the transducers after convergence for each lan- guage. In case of disagreements, the number re- It should be noted that because the transducers ported is the mode. output by OSTIA and DD-OSTIA correctly repro- To answer the question of whether production duce all the pairs used to construct them, once ei- is harder than comprehension in this setting, we ther algorithm has seen all 444 possible data pairs also considered the comprehension task, that is, in either the production or the comprehension task, to translate an utterance in a natural language the resulting transducers will correctly translate all into the corresponding sequence of predicates. correct inputs. However, state-merging in the al-

20 Languages #states #edges Languages #states #edges Arabic 2 17 Arabic 1 15 English 2 16 English 1 13 Greek 9 45 Greek 2 25 Hebrew 2 13 Hebrew 1 13 Hindi 7 40 Hindi 1 13 Hungarian 3 20 Hungarian 1 14 Mandarin 1 10 Mandarin 1 17 Russian 3 23 Russian 1 24 Spanish 2 13 Spanish 1 14 Turkish 3 18 Turkish 1 13

Table 4: Production task, DD-OSTIA. Sizes of Table 6: Comprehension task, OSTIA and DD- transducers at convergence. OSTIA. Sizes of transducers at convergence using 400 random unique input-output pairs (utterance, Languages OSTIA DD-OSTIA predicate sequence). In cases of disagreement, the Arabic 65 65 number reported is the mode. English 60 20 Greek 325 60 Hebrew 90 45 ducer that achieves correct translation of correct Hindi 60 35 utterances into predicate sequences. OSTIA and Hungarian 40 45 DD-OSTIA converged to 1-state transducers for Mandarin 60 40 all languages except Greek, for which they con- Russian 280 55 verged to 2-state transducers. Examining one such Spanish 45 30 transducer for Greek, we found that the require- Turkish 60 35 ment that the transducer be “onward” necessitated two states. These results are broadly compatible with the results obtained by Oncina et al. (Castel- Table 5: Comprehension task, OSTIA and DD- lanos et al., 1993) on language understanding; the OSTIA. Median number (in 10 runs) of input- more complex tasks they consider also give evi- output pairs until convergence using a sequence of dence that this approach may scale well for the 400 random unique pairs of (utterance, predicate comprehension task. sequence). Turning to the production task (Tables 1, 2, 3 and 4), we see that providing the random samples gorithms induces compression and generalization, with a length-lex ordering of utterances has incon- and the interesting questions are how much data sistent effects for both OSTIA and DD-OSTIA, is required to achieve correct generalization, and sometimes dramatically increasing or decreasing how that quantity scales with the complexity of the number of samples required. We do not fur- the task. This are very difficult questions to ap- ther consider the sorted samples. proach analytically, but empirical results can offer Comparing the production task with the com- valuable insights. prehension task for OSTIA, the production task Considering the comprehension task (Tables 5 generally requires substantially more random and 6), we see that OSTIA generalizes correctly unique samples than the comprehension task for from at most 15% of all 444 possible pairs except the same language. The exceptions are Mandarin in the cases of Greek, Hebrew and Russian. DD- (production: 45 and comprehension: 60) and Rus- OSTIA improves the OSTIA results, in some cases sian (production: 270 and comprehension: 280). dramatically, for all languages except Hungarian. For DD-OSTIA the results are similar, with the DD-OSTIA achieves correct generalization from sole exception of Mandarin (production: 40 and at most 15% of all possible pairs for all ten lan- comprehension: 40). For the production task DD- guages. Because the meaning function for all ten OSTIA requires fewer (sometimes dramatically language transducers is independent of the state, fewer) samples to converge than OSTIA. How- in each case there is a 1-state sequential trans- ever, even with DD-OSTIA the number of sam-

21 ples is in several cases (Greek, Hindi, Hungarian speech” by inverting the lexicon. Also, the in- and Russian) a rather large fraction (40% or more) termediate results of the learning process do not of all 444 possible pairs. Further experimentation seem to have the properties we expect of a learner and analysis is required to determine how these re- who is progressing towards mastery of produc- sults will scale. tion. That is, the intermediate transducers per- Looking at the sizes of the transducers learned fectly translate the predicate sequences used to by OSTIA and DD-OSTIA in the production task, construct them, but typically produce other trans- we see that the numbers of states agree for all lan- lations that the learner (using the lexicon) would guages except Turkish. (Recall from our discus- know to be incorrect. For example, the intermedi- sion in Section 4 that there may be differences in ate transducer from Figure 1 translates the predi- the behavior of the transducers learned by OSTIA cate sequence (ci) as el circulo a la izquierda del and DD-OSTIA at convergence.) For the produc- circulo verde, which the learner’s lexicon indicates tion task, Mandarin gives the smallest transducer; should be translated as (ci, le, ci, gr). for this fragment of the language, the translation of correct predicate sequences into utterances can 6 Future work be achieved with a 1-state transducer. In contrast, Further experiments and analysis are required to English and Spanish both require 2 states to handle understand how these results will scale with larger articles correctly. For example, in the transducer domains and languages. In this connection, it may in Figure 3, the predicate for a circle (ci) is trans- be interesting to try the experiments of (Castel- lated as el circulo if it occurs as the first object (in lanos et al., 1993) in the reverse (production) di- state 1) and as circulo if it occurs as second ob- rection. Finding a way to incorporate the learner’s ject (in state 2) because del has been supplied by initial lexicon seems important. Perhaps by incor- the translation of the intervening binary relation porating the learner’s knowledge of the input do- (le, ler, or ab.) Greek gives the largest transducer main (the legal sequences of predicates) and using for the production task, with 9 states, and requires the domain-aware version, OSTIA-D, the interme- the largest number of samples for DD-OSTIA to diate results in the learning process would be more achieve convergence, and one of the largest num- compatible with our modeling objectives. Coping bers of samples for OSTIA. Despite the evidence with errors will be necessary; perhaps an explic- of the extremes of Mandarin and Greek, the rela- itly statistical framework for machine translation tion between the size of the transducer for a lan- should be considered. guage and the number of samples required to con- If we can find an appropriate model of how verge to it is not monotonic. the learner’s language production process might In our model, one reason that learning the pro- evolve, then we will be in a position to model duction task may in general be more difficult than meaning-preserving corrections. That is, the learning the comprehension task is that while the learner chooses a sequence of predicates and maps mapping of a word to a predicate does not depend it to a (flawed) utterance. Despite its flaws, the on context, the mapping of a predicate to a word learner’s utterance is understood by the teacher or words does (except in the case of our Mandarin (i.e., the teacher is able to map it to the sequence transducer.) As an example, in the comprehension of predicates chosen by the learner) and responds task the Russian words triugolnik, triugolnika and with a correction, that is, a correct utterance for triugonikom are each mapped to the predicate tr, that meaning. The learner, recognizing that the but the reverse mapping must be sensitive to the teacher’s utterance has the same meaning but a context of the occurrence of tr. different form, then uses the correct utterance (as These results suggest that OSTIA or DD- well as the meaning and the incorrect utterance) to OSTIA may be an effective method to learn to improve the mapping of sequences of predicates to translate sequences of predicates into natural lan- utterances. guage utterances in our domain. However, some of It is clear that in this model, corrections are not our objectives seem incompatible with the proper- necessary to the process of learning comprehen- ties of OSTIA. In particular, it is not clear how sion and production; once the learner has a correct to incorporate the learner’s initial knowledge of lexicon, the utterances of the teacher can be trans- the lexicon and ability to produce “telegraphic lated into sequences of predicates, and the pairs

22 of (predicate sequence, utterance) can be used to Jose Oncina. 1998. The data driven approach applied learn (via an appropriate variant of OSTIA) a per- to the OSTIA algorithm. ICGI, 50–56. fect production mapping. However, it seems very Jose Oncina and Miguel Angel Varo. 1996. Using do- likely that corrections can make the process of main information during the learing of a subsequen- learning a production mapping easier or faster, and tial transducer. ICGI, 301–312. finding a model in which such phenomena can be Roberto Pieraccini, Esther Levin, and Enrique Vidal. studied remains an important goal of this work. 1993. Learning how to understand language. Eu- roSpeech’93, 448–458. 7 Acknowledgments The authors sincerely thank Prof. Jose Oncina for the use of his programs for OSTIA and DD- OSTIA, as well as his helpful and generous ad- vice. The research of Leonor Becerra-Bonache was supported by a Marie Curie International Fellowship within the 6th European Community Framework Programme.

References Dana Angluin and Leonor Becerra-Bonache. 2008. Learning Meaning Before Syntax. ICGI, 281–292.

Leonor Becerra-Bonache, Colin de la Higuera, J.C. Janodet, and Frederic Tantini. 2007. Learning Balls of Strings with Correction Queries. ECML, 18–29.

Leonor Becerra-Bonache, Adrian H. Dediu, and Cristina Tirnauca. 2006. Learning DFA from Cor- rection and Equivalence Queries. ICGI, 281–292.

Jean Berstel. 1979. Transductions and Context-Free Languages. PhD Thesis, Teubner, Stuttgart, 1979.

Antonio Castellanos, Enrique Vidal, and Jose Oncina. 1993. Language Understanding and Subsequential Transducers. ICGI, 11/1–11/10.

Antonio Castellanos, Ismael Galiano, and Enrique Vi- dal. 1994. Applications of OSTIA to machine trans- lation tasks. ICGI, 93–105.

Michelle M. Chouinard and Eve V. Clark. 2003. Adult Reformulations of Child Errors as Negative Evi- dence. Journal of Child Language, 30:637–669.

Jerome A. Feldman, George Lakoff, Andreas Stolcke, and Susan Hollback Weber. 1990. Miniature Lan- guage Acquisition: A touchstone for cognitive sci- ence. Technical Report, TR-90-009. International Computer Science Institute, Berkeley, California. April, 1990.

Efim Kinber. 2008. On Learning Regular Expres- sions and Patterns Via Membership and Correction Queries. ICGI, 125–138.

Jose Oncina. 1991. Aprendizaje de lenguajes regu- lares y transducciones subsecuenciales. PhD Thesis, Universitat Politecnica de Valencia, Valencia, Spain, 1998.

23 GREAT: a finite-state machine translation toolkit implementing a Grammatical Inference Approach for Transducer Inference (GIATI)

Jorge González and Francisco Casacuberta Departamento de Sistemas Informáticos y Computación Instituto Tecnológico de Informática Universidad Politécnica de Valencia {jgonzalez,fcn}@dsic.upv.es

Abstract The conditional probability Pr(t|s) can be re- placed by a joint probability distribution Pr(s, t) GREAT is a finite-state toolkit which is which is modelled by a stochastic transducer being devoted to Machine Translation and that inferred through the GIATI methodology (Casacu- learns structured models from bilingual berta et al., 2004; Casacuberta and Vidal, 2004): data. The training procedure is based on grammatical inference techniques to ob- ˆt = argmax Pr(s, t) (2) tain stochastic transducers that model both t the structure of the languages and the re- This paper describes GREAT, a software pack- lationship between them. The inference age for bilingual modelling from parallel corpus. of grammars from natural language causes GREAT is a finite-state toolkit which was born the models to become larger when a less to overcome the computational problems that pre- restrictive task is involved; even more if vious implementations of GIATI (Picó, 2005) had a bilingual modelling is being considered. in practice when huge amounts of data were used. GREAT has been successful to implement Even more, GREAT is the result of a very metic- the GIATI learning methodology, using ulous study of GIATI models, which improves different scalability issues to be able to the treatment of smoothing transitions in decod- deal with corpora of high volume of data. ing time, and that also reduces the required time to This is reported with experiments on the translate an input sentence by means of an analysis EuroParl corpus, which is a state-of-the- that will depend on the granularity of the symbols. art task in Statistical Machine Translation. Experiments for a state-of-the-art, voluminous 1 Introduction translation task, such as the EuroParl, are re- ported. In (González and Casacuberta, 2007), Over the last years, grammatical inference tech- the so called phrase-based finite-state transducers niques have not been widely employed in the ma- were concluded to be a better modelling option for chine translation area. Nevertheless, it is not un- this task than the ones that derive from a word- known that researchers are trying to include some based approach. That is why the experiments here structured information into their models in order to are exclusively related to this particular kind of capture the grammatical regularities that there are GIATI-based transducers. in languages together with their own relationship. The structure of this work is as follows: first, GIATI (Casacuberta, 2000; Casacuberta et al., section 2 is devoted to describe the training proce- 2005) is a grammatical inference methodology to dure, which is in turn divided into several lines, for infer stochastic transducers in a bilingual mod- instance, the finite-state GIATI-based models are elling approach for statistical machine translation. defined and their corresponding grammatical in- From a statistical point of view, the translation ference methods are described, including the tech- problem can be stated as follows: given a source niques to deal with tasks of high volume of data; sentence s = s1 ... sJ , the goal is to find a target then, section 3 is related to the decodification pro- sentence ˆt t1 tˆ, among all possible target = ... I cess, which includes an improved smoothing be- strings t, that maximises the posterior probability: haviour and an analysis algorithm that performs according to the granularity of the bilingual sym- ˆt = argmax Pr(t|s) (1) t bols in the models; to continue, section 4 deals Proceedings of the EACL 2009 Workshop on Computational Linguistic Aspects of Grammatical Inference, pages 24–32, Athens, Greece, 30 March 2009. c 2009 Association for Computational Linguistics

24 with an exhaustive report on experiments; and fi- to infer stochastic finite-state transducers through nally, the conclusions are stated in the last section. the modelling of languages. Rather than learn- ing translations, GIATI first converts every pair 2 Finite state models of parallel sentences in the training corpus into a A stochastic finite-state automaton A is a tuple corresponding extended-symbol string in order to, (Γ,Q,i,f,P ), where Γ is an alphabet of symbols, straight afterwards, infer a language model from. Q is a finite set of states, functions i : Q → [0, 1] More concretely, given a parallel corpus con- and f : Q → [0, 1] refer to the probability of each sisting of a finite sample C of string pairs: first, state to be, respectively, initial and final, and par- each training pair (¯x, y¯) ∈ Σ⋆ ×∆⋆ is transformed cial function P : Q × {Γ ∪ ε}× Q → [0, 1] de- into a string z¯ ∈ Γ⋆ from an extended alphabet, fines a set of transitions between pairs of states in yielding a string corpus S; then, a stochastic finite- such a way that each transition is labelled with a state automaton A is inferred from S; finally, tran- symbol from Γ (or the empty string ε), and is as- sition labels in A are turned back into pairs of signed a probability. Moreover, functions i,f, and strings of source/target symbols in Σ⋆ × ∆⋆, thus P have to respect the consistency property in or- converting the automaton A into a transducer T . der to define a distribution of probabilities on the The first transformation is modelled by some la- free monoid. Consistent probability distributions belling function L : Σ⋆ × ∆⋆ → Γ⋆, while the last can be obtained by requiring a series of local con- transformation is defined by an inverse labelling straints which are similar to the ones for stochastic function Λ(·), such that Λ(L(C)) = C. Build- regular grammars (Vidal et al., 2005): ing a corpus of extended symbols from the original bilingual corpus allows for the use of many useful • i(q) = 1 X algorithms for learning stochastic finite-state au- q∈Q tomata (or equivalent models) that have been pro- ′ • ∀q ∈ Q : X P (q,γ,q )+f(q) = 1 posed in the literature on grammatical inference. γ∈{Γ∪ε},q′∈Q 2.2 Phrase-based n-gram transducers A stochastic finite-state transducer is defined similarly to a stochastic finite-state automaton, Phrase-based n-gram transducers represent an in- with the difference that transitions between states teresting application of the GIATI methodology, are labelled with pairs of symbols that belong to where the extended symbols are actually bilingual two different (input and output) alphabets, that phrase pairs, and n-gram models are employed as is, (Σ ∪ ε) × (∆ ∪ ε). Then, given some in- language models (González et al., 2008). Figure 2 put and output strings, s and t, a stochastic finite- shows a general scheme for the representation of state transducer is able to associate them a joint n-grams through stochastic finite state automata. probability Pr(s, t). An example of a stochastic N−GRAM finite-state transducer can be observed in Figure 1. TRANSITIONS

n1 n2 n3 n HISTORY LEVEL N−1 c : ABC 0.8 . . . KN 0.2 (N−1)−GRAM BACKOFF TRANSITIONS TRANSITIONS c : C 0.2 a : CA Q1 Q3

0.3 ε : ε 0.7 ... ε : ε 0.3 ε : ε 0.3 TRIGRAM BACKOFF Q0 1 c : BC 0.5 TRANSITIONS TRANSITIONS 0.2

b1 b2 b3 bK2 HISTORY LEVEL 2 b : ε 0.3 Q2 Q4 . . . ε : ε 0.8 0.2 BIGRAM BACKOFF TRANSITIONS TRANSITIONS c : C 0.2

HISTORY u1 u2 u3 u . . . K1 LEVEL 1 Figure 1: A stochastic finite-state transducer

UNIGRAM BACKOFF TRANSITIONS TRANSITIONS

2.1 Inference of stochastic transducers NULL HISTORY LEVEL 0 The GIATI methodology (Casacuberta et al., 2005) has been revealed as an interesting approach Figure 2: A finite-state n-gram model

25 The states in the model refer to all the n-gram they do not belong to the original automaton histories that have been seen in the string corpus S model. As a result, they are non-final states, with in training time. Consuming transitions jump from only one incoming and one outcoming edge each. states in a determined layer to the one immediately above, increasing the history level. Once the top 2.3 Transducer pruning via n-gram events level has been reached, n-gram transitions allow GREAT implements this pruning technique, which for movements inside this layer, from state to state, is inspired by some other statistical machine trans- updating the history to the last n − 1 seen events. lation decoders that usually filter their phrase- Given that an n-gram event based translation dictionaries by means of the Γn−1Γn−2 ... Γ2Γ1Γ0 is statistically stated words in the test set sentences (Koehn et al., 2007). as Pr(Γ0|Γn−1Γn−2 ... Γ2Γ1), then it is appro- As already seen in Figure 3, any n-gram event priately represented as a finite state transition is represented as a transition between their cor- between their corresponding up-to-date histories, responding historical states. In order to be able which are associated to some states (see Figure 3). to navigate through this transition, the analy- sis must have reached the Γn−1 ... Γ3Γ2Γ1 state Γn−1Γn−2 ... Γ2Γ1Γ0 and the remaining input must fit the source ele- ments of Γ0. In other words, the full source se- quence from the n-gram event Γn−1 ... Γ3Γ2Γ1Γ0 has to be present in the test set. Otherwise, Γn−1 ... Γ3Γ2Γ1 Γn−2 ... Γ2Γ1Γ0 its corresponding transition will not be able to Γ0 be employed during the analysis of the test set Figure 3: Finite-state representation of n-grams sentences. As a result, n-gram events that are not in the test set can skip their transition gener- ation, since they will not be affected during de- Therefore, transitions are labelled with a sym- coding time, thus reducing the size of the model. bol from Γ and every extended symbol in Γ is a If there is also a backoff probability that is asso- translation pair coming from a phrase-based dic- ciated to the same n-gram event, its correspond- tionary which is inferred from the parallel corpus. ing transition generation can be skipped too, since Nevertheless, backoff transitions to lower his- its source state will never be reached, as it is the tory levels are taken for smoothing purposes. If state which represents the n-gram event. Nev- the lowest level is reached and no transition has ertheless, since trained extended-symbol n-gram been found for next word sj, then a transition to events would typically include more than n source the state is fired, thus considering sj as a words, the verification of their presence or their non-starting word for any bilingual phrase in the absence inside the test set would imply hashing all model. There is only 1 initial state, which is deno- the test-set word sequences, which is rather im- ted as , and it is placed at the 1st history level. practical. Instead, a window size is used to hash The inverse labelling function is applied over the words in the test set, then the trained n-gram the automaton transitions as in Figure 4, obtaining events are scanned on their source sequence using a single transducer (Casacuberta and Vidal, 2004). this window size to check if they might be skipped or not. It should be clear that the bigger the win- On demande une activité dow size is, the more n-gram rejections there will QPr = p Q’ be, therefore the transducer will be smaller. How- Action is required ever, the translation results will not be affected as these disappearing transitions are unreachable us- ing that test set. As the window size increases, the activité / resulting filtered transducer is closer to the mini- On /ε demande /ε une /ε Action is required Q Q’ mum transducer that reflects the test set sentences. Pr = p Pr = 1 Pr = 1 Pr = 1 3 Finite state decoding Figure 4: Phrase-based inverse labelling function Equation 2 expresses the MT problem in terms of Intermediate states are artificially created since a finite state model that is able to compute the ex-

26 pression Pr(s, t). Given that only the input sen- should not be taken into account. However, as far tence is known, the model has to be parsed, taking as the words in the test sentence are compatible into account all possible t that are compatible with with the corresponding transitions, and according s. The best output hypothesis ˆt would be that one to the phrase score, this (word) synchronous pars- which corresponds to a path through the transduc- ing algorithm may store these intermediate states tion model that, with the highest probability, ac- into the trellis structure, even if the full path will cepts the input sequence as part of the input lan- not be accomplished in the end. As a consequence, guage of the transducer. these entries will be using a valuable position in- Although the navigation through the model is side the trellis structure to an idle result. This will constrained by the input sentence, the search space be not only a waste of time, but also a distortion can be extremely large. As a consequence, only on the best score per stage, reducing the effective the most scored partial hypotheses are being con- power of the beam parameter during the decoding. sidered as possible candidates to become the solu- Some other better analysis options may be rejected tion. This search process is very efficiently carried because of their a-priori lower score. Therefore, out by a beam-search approach of the well known this decoding algorithm can lead the system to a Viterbi algorithm (Jelinek, 1998), whose temporal worse translation result. Alternatively, the beam asymptotic cost is Θ(J ·|Q|· M), where M is the factor can be increased in order to be large enough average number of incoming transitions per state. to store the successful paths, thus more time will be required for the decoding of any input sentence. 3.1 Parsing strategies: from words to phrases On the other hand, a phrase-based analysis stra- tegy would never include intermediate states in- The trellis structure that is commonly employed side a trellis structure. Instead, these artificial for the analysis of an input sentence through a states are tried to be parsed through until an ori- stochastic finite state transducer has a variable size ginal state is being reached, i.e. Q’ in Figure 4. that depends on the beam factor in a dynamic Word-based and phrase-based analysis are con- beam-search strategy. That way, only those nodes ceptually compared in Figure 5, by means of their scoring at a predefined threshold from the best one respective edge generation on the trellis structure. in every stage will be considered for the next stage. A word-based parsing strategy would start with On demande une activité the initial state , looking for the best transi- tions that are compatible with the first word s1. The corresponding target states are then placed WORD−BASED EDGES into the output structure, which will be used for the analysis of the second word s2. Iteratively, every state in the structure is scanned in order to get the Q Q’ PHRASE−BASED EDGES input labels that match the current analysis word si, and then to build an output structure with the Figure 5: Word-based and phrase-based analysis best scored partial paths. Finally, the states that result from the last analysis step are then rescored However, in order to be able to use a scrolled by their corresponding final probabilities. two-stage implementation of a Viterbi phrase- This is the standard algorithm for parsing based analysis, the target states, which may be a source sentence through an non-deterministic positioned at several stages of distance from the stochastic finite state transducer. Nevertheless, it current one, are directly advanced to the next one. may not be the most appropriate one when dealing Therefore, the nodes in the trellis must be stored with this type of phrase-based n-gram transducers. together with their corresponding last input posi- As it must be observed in Figure 4, a set of tion that was parsed. In the same manner, states consecutive transitions represent only one phrase in the structure are only scanned if their posi- translation probability after a given history. In tion indicator is lower than the current analysis fact, the path from Q to Q’ should only be fol- word. Otherwise, they have already taken it into lowed if the remaining input sentence, which has account so they are directly transfered to the next not been analysed yet, begins with the full input stage. The algorithm remains synchronous with sequence On demande une activité. Otherwise, it the words in the input sentence, however, on this

27 particular occasion, states in the i-th step of anal- such a way as if they could be internally repre- ysis are guaranteed to have parsed at least until senting any possible bilingual symbol from the ex- the i-th word, but maybe they have gone further. tended vocabulary that matches their source sides. Figure 6 is a graphical diagram about this concept. That way, bilingual symbols are considered to be a sort of input, so the backoff smoothing criterion is i j then applied to each compatible, bilingual symbol. On demande une activité For phrase-based transducers, it means that for a successful transition (¯x, y¯), there is no need to go ′ ′ ′ Qj Qj Qj backoff and find other paths consuming that bilin- gual symbol, but we must try backoff transitions to look for any other successful transition (¯x′, y¯′),

′ which is also compatible with the current position. Qi Qj Conceptually, this procedure would be as if the input sentence, rather than a source string, was ac- Figure 6: A phrase-based analysis implementation tually composed of a left-to-right bilingual graph, Moreover, all the states that are being stored in being built from the expansion of every input word the successive stages, that is, the ones from the ori- into their compatible, bilingual symbols as in a ginal topology of the finite-state representation of category-based approach. Phrase-based bilingual symbols would be graphically represented as a sort the n-gram model, are also guaranteed to lead to a final state in the model, because if they are not of skip transitions inside this bilingual input graph. final states themselves, then there will always be a This new interpretation about the backoff successful path towards a final state. smoothing weights on bilingual n-gram models, which is not a priori a trivial feature to be included, GREAT incorporates an analysis strategy that is easily implemented for stochastic transducers depends on the granularity of the bilingual sym- by considering backoff transitions as transi- bols themselves so that a phrase-based decoding ε/ε tions and keeping track of a dynamic list of forbid- is applied when a phrase-based transducer is used. den states every time a backoff transition is taken. 3.2 Backoff smoothing An outline about the management of state ac- tiveness, which is integrated into the parsing algo- Two smoothing criteria have been explored in or- rithm, is shown below: der to parse the input through the GIATI model. First, a standard backoff behaviour, where back- ALGORITHM off transitions are taken as failure transitions, was implemented. There, backoff transitions are only for Q in {states to explore} followed if there is not any other successful path for Q-Q' in {transitions} (a) that has been compatible with the remaining input. if Q' is active However, GREAT also includes another more [...] refined smoothing behaviour, to be applied over set Q' to inactive the same bilingual n-gram transducers, where if Q is not NULL smoothing edges are interpreted in a different way. if Q not in the top level GREAT suggests to apply the backoff crite- for Q' in {inactive states} rion according to its definition in the grammati- set Q' to active cal inference method which incorporated it into Q'' := backoff(Q') the language model being learnt and that will be set Q'' to inactive represented as a stochastic finite-state automaton. Q := backoff(Q) GoTo (a) In other words, from the n-gram point of view, backoff weights (or finite-state transitions) should else only be employed if no transitions are found in the [...] for Q' in {inactive states} n-gram automaton for a current bilingual symbol. Nevertheless, input words in translation applica- set Q' to active tions do not belong to those bilingual languages. [...] Instead, input sequences have to be analysed in END ALGORITHM

28 The algorithm will try to translate several con- of the Association for Computational Linguistics. secutive input words as a whole phrase, always al- The corpus characteristics can be seen in Table 1. lowing a backoff transition in order to cover all the compatible phrases in the model, not only the Table 1: Characteristics of the Fr-En EuroParl. ones which have been seen after a given history, French English but after all its suffixes as well. A dynamic list Sentences 688031 of forbidden states will take care to accomplish an Training Run. words 15.6 M 13.8 M exploration constraint that has to be included into Vocabulary 80348 61626 the parsing algorithm: a path between two states Sentences 2000 Q and Q’ has necessarily to be traced through the Dev-Test Run. words 66200 57951 minimum number of backoff transitions; any other Q-Q’ or Q-Q” paths, where Q” is the destination of a Q’-Q” backoff path, should be ignored. This The EuroParl corpus is built on the proceedings constraint will cause that only one transition per of the European Parliament, which are published bilingual symbol will be followed, and that it will on its web and are freely available. Because of be the highest in the hierarchy of history levels. its nature, this corpus has a large variability and Figure 7 shows a parsing example over a finite- complexity, since the translations into the differ- state representation of a smoothed bigram model. ent official languages are performed by groups of human translators. The fact that not all transla-

p2 tors agree in their translation criteria implies that a given source sentence can be translated in various different ways throughout the corpus. p1 Q p1 p2 p3 Since the proceedings are not available in every language as a whole, a different subset of the cor- pus is extracted for every different language pair, thus evolving into somewhat a different corpus for p1 p2 p3 each pair of languages.

4.1 System evaluation ε We evaluated the performance of our methods by using the following evaluation measures: Figure 7: Compatible edges for a bigram model. Given a reaching state Q, let us assume that the BLEU (Bilingual Evaluation Understudy) score: transitions that correspond to certain bilingual This indicator computes the precision of uni- phrase pairs p1 , p2 and p3 are all compatible with grams, bigrams, trigrams, and tetragrams the remaining input. However, the bigram (Q, with respect to a set of reference translations, p3 ) did not occur throughout the training corpus, with a penalty for too short sentences (Pap- therefore there is no a direct transition from Q to ineni et al., 2001). BLEU measures accuracy, p3 . A backoff transition enables the access to p3 not error rate. because the bigram (Q, p3 ) turns into a unigram event that is actually inside the model. Unigram WER (Word Error Rate): The WER criterion calcu- transitions to p1 and p2 must be ignored because lates the minimum number of editions (subs- their corresponding bigram events were success- titutions, insertions or deletions) that are fully found one level above. needed to convert the system hypothesis into the sentence considered ground truth. Be- cause of its nature, this measure is very pes- 4 Experiments simistic. GREAT has been successfully employed to work with the French-English EuroParl corpus, that is, Time. It refers to the average time (in milliseconds) the benchmark corpus of the NAACL 2006 shared to translate one word from the test corpus, task of the Workshop on Machine Translation without considering loading times.

29 4.2 Results Table 4: Results for a phrase-based analysis. A set of experimental results were obtained in or- beam Time (ms) BLEU WER der to assess the impact of the proposed techniques 1.00 0.2 19.8 71.8 in the work with phrase-based n-gram transducers. 1.02 0.4 22.1 68.6 By assuming an unconstrained parsing, that is, 1.05 0.7 24.3 66.0 the successive trellis structure is large enough to 1.10 2.4 26.1 64.2 store all the states that are compatible within the 1.25 7.0 27.1 62.8 analysis of a source sentence, the results are not 1.50 9.7 27.5 62.3 very sensitive to the n-gram degree, just showing 2.00 11.4 27.8 62.0 that bigrams are powerful enough for this corpus. 3.50 12.3 28.0 61.9 However, apart from this, Table 2 is also show- ing a significant better performance for the second, more refined behaviour for the backoff transitions. of states in every iteration of the algorithm is in terms of temporal requirements. Table 2: Results for the two smoothing criteria. However, a phrase-based approach only stores those states that have been successfully reached by n a full phrase compatibility with the input sentence. Backoff 1 2 3 4 5 Therefore, it takes more time to process an indi- baseline vidual state, but since the list of states is shorter, BLEU 26.8 26.3 25.8 25.7 25.7 the search method performs at a better speed rate. WER 62.3 63.9 64.5 64.5 64.5 Another important element to point out between GREAT Tables 3 and 4, is about the differences on quality BLEU 26.8 28.0 27.9 27.9 27.9 results for a same beam parameter in both tables. WER 62.3 61.9 62.0 62.0 62.0 Word-based decoding strategies suffer the effec- tive reduction on the beam factor that was men- From now on, the algorithms will be tested on tioned on section 3.1 because their best scores on the phrase-based bigram transducer, being built every analysis stage, which determine the explo- according to the GIATI method, where backoff is ration boundaries, may refer to a no way out path. employed as ε/ε transitions with forbidden states. Logically, these differences are progressively re- In these conditions, the results, following a duced as the beam parameter increases, since the word-based and a phrase-based decoding strategy, search space is explored in a more exhaustive way. which are in function of the dynamic beam factor, can be analysed in Tables 3 and 4. Table 5: Number of trained and survived n-grams.

Table 3: Results for a word-based analysis. n-grams Window size unigrams bigrams beam Time (ms) BLEU WER No filter 1,593,677 4,477,382 1.00 0.1 0.4 94.6 2 299,002 512,943 1.02 0.3 12.8 81.9 3 153,153 141,883 1.05 5.2 20.0 74.0 4 130,666 90,265 1.10 30.0 24.9 68.2 5 126,056 78,824 1.25 99.0 27.1 64.6 6 125,516 77,341 1.50 147.0 27.5 62.9 2.00 173.6 27.8 62.1 On the other hand, a phrase-based extended- 3.50 252.3 28.0 61.9 symbol bigram model, being learnt by means of the full training data, computes an overall set of From the comparison of Tables 3 and 4, it can approximately 6 million events. The application be deduced that a word-based analysis is itera- of the n-gram pruning technique, using a grow- tively taking into account a quite high percentage ing window parameter, can effectively reduce that of useless states, thus needing to increase the beam number to only 200,000. These n-grams, when parameter to include the successful paths into the represented as transducer transitions, suppose a re- analysis. The price for considering such a long list duction from 20 million transitions to only those

30 500,000 that are affected by the test set sentences. Acknowledgments As a result, the size of the model to be parsed decreases, therefore, the decoding time also de- This work was supported by the “Vicerrectorado creases. Tables 5 and 6 show the effect of this de Innovación y Desarrollo de la Universidad pruning method on the size of the transducers, then Politécnica de Valencia”, under grant 20080033. on the decoding time with a phrase-based analysis, which is the best strategy for phrase-based models. References Table 6: Decoding time for several windows sizes. Francisco Casacuberta and Enrique Vidal. 2004. Ma- chine translation with inferred stochastic finite-state Window size Edges Time (ms) transducers. Comput. Linguistics, 30(2):205–225. No filter 19,333,520 362.4 2 2,752,882 41.3 Francisco Casacuberta, Hermann Ney, Franz Josef 3 911,054 17.3 Och, Enrique Vidal, Juan Miguel Vilar, Sergio Bar- 4 612,006 12.8 rachina, Ismael García-Varea, David Llorens, César Martínez, and Sirko Molau. 2004. Some ap- 5 541,059 11.9 proaches to statistical and finite-state speech-to- 6 531,333 11.8 speech translation. Computer Speech & Language, 18(1):25–47. Needless to say that BLEU and WER keep on Francisco Casacuberta, Enrique Vidal, and David Picó. their best numbers for all the transducer sizes as 2005. Inference of finite-state transducers from reg- the test set is not present in the pruned transitions. ular languages. Pattern Recognition, 38(9):1431– 1443. 5 Conclusions F. Casacuberta. 2000. Inference of finite-state trans- GIATI is a grammatical inference technique to ducers by using regular grammars and morphisms. In A.L. Oliveira, editor, Grammatical Inference: Al- learn stochastic transducers from bilingual data gorithms and Applications, volume 1891 of Lecture for their usage in statistical machine translation. Notes in Computer Science, pages 1–14. Springer- Finite-state transducers are able to model both the Verlag. 5th International Colloquium Grammatical structure of both languages and their relationship. Inference -ICGI2000-. Lisboa. Portugal. GREAT is a finite-state toolkit which was born to J. González and F. Casacuberta. 2007. Phrase-based overcome the computational problems that previ- finite state models. In Proceedings of the 6th In- ous implementations of GIATI present in practice ternational Workshop on Finite State Methods and when huge amounts of parallel data are employed. Natural Language Processing (FSMNLP), Potsdam Moreover, GREAT is the result of a very metic- (Germany), September 14-16. ulous study of GIATI models, which improves J. González, G. Sanchis, and F. Casacuberta. 2008. the treatment of smoothing transitions in decod- Learning finite state transducers using bilingual ing time, and that also reduces the required time to phrases. In 9th International Conference on Intel- translate an input sentence by means of an analysis ligent Text Processing and Computational Linguis- that will depend on the granularity of the symbols. tics. Lecture Notes in Computer Science, Haifa, Is- rael, February 17 to 23. A pruning technique has been designed for n- gram approaches, which reduces the transducer Frederick Jelinek. 1998. Statistical Methods for size to integrate only those transitions that are re- Speech Recognition. The MIT Press, January. ally required for the translation of the test set. That has allowed us to perform some experiments con- Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, cerning a state-of-the-art, voluminous translation Brooke Cowan, Wade Shen, Christine Moran, task, such as the EuroParl, whose results have been Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra reported in depth. A better performance has been Constantin, and Evan Herbst. 2007. Moses: Open found when a phrase-based decoding strategy is source toolkit for statistical machine translation. In selected in order to deal with those GIATI phrase- ACL. The Association for Computer Linguistics. based transducers. Besides, this permits us to ap- K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2001. ply a more refined interpretation of backoff transi- Bleu: a method for automatic evaluation of machine tions for a better smoothing translation behaviour. translation.

31 David Picó. 2005. Combining Statistical and Finite- State Methods for Machine Translation. Ph.D. the- sis, Departamento de Sistemas Informáticos y Com- putación. Universidad Politécnica de Valencia, Va- lencia (Spain), September. Advisor: Dr. F. Casacu- berta.

E. Vidal, F. Thollard, F. Casacuberta C. de la Higuera, and R. Carrasco. 2005. Probabilistic finite-state ma- chines - part I. IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 27(7):1013–1025.

32 A note on contextual binary feature grammars

Alexander Clark R´emi Eyraud and Amaury Habrard Department of Computer Science Laboratoire d’Informatique Fondamentale Royal Holloway, University of London de Marseille, CNRS, [email protected] Aix-Marseille Universit´e,France remi.eyraud,[email protected]

Abstract instead of classical non terminals. The use of features is interesting from an NLP point of Contextual Binary Feature Grammars view because we can associate some semantics were recently proposed by (Clark et al., to them, and because we can represent com- 2008) as a learnable representation for plex, structured syntactic categories. The no- richly structured context-free and con- tion of contexts is relevant from a grammatical text sensitive languages. In this pa- inference standpoint since they are easily ob- per we examine the representational servable from a finite sample. In this paper power of the formalism, its relationship we establish some basic language theoretic re- to other standard formalisms and lan- sults about the class of exact Contextual Bi- guage classes, and its appropriateness nary Feature Grammars (defined in Section 3), for modelling natural language. in particular their relationship to the Chomsky hierarchy: exact CBFGs are those where the 1 Introduction contextual features are associated to all the An important issue that concerns both natu- possible strings that can appear in the corre- ral language processing and machine learning sponding contexts of the language defined by is the ability to learn suitable structures of a the grammar. language from a finite sample. There are two The main results of this paper are proofs major points that have to be taken into ac- that the class of exact CBFGs: count in order to define a learning method use- • properly includes the regular languages ful for the two fields: first the method should (Section 5), rely on intrinsic properties of the language it- • does not include some context-free lan- self, rather than syntactic properties of the guages (Section 6), representation. Secondly, it must be possible • and does include some non context-free to associate some semantics to the structural languages (Section 7). elements in a natural way. Thus, this class of exact CBFGs is orthog- Grammatical inference is clearly an impor- onal to the classic Chomsky hierarchy but tant technology for NLP as it will provide a can represent a very large class of languages. foundation for theoretically well-founded un- Moreover, it has been shown that this class supervised learning of syntax, and thus avoid is efficiently learnable. This class is therefore the annotation bottleneck and the limitations an interesting candidate for modeling natural of working with small hand-labelled treebanks. language and deserves further investigation. Recent advances in context-free grammati- 2 Basic Notation cal inference have established that there are large learnable classes of context-free lan- We consider a finite alphabet Σ, and Σ∗ the guages. In this paper, we focus on the ba- free monoid generated by Σ. λ is the empty sic representation used by the recent approach string, and a language is a subset of Σ∗. We proposed in (Clark et al., 2008). The authors will write the concatenation of u and v as uv, consider a formalism called Contextual Binary and similarly for sets of strings. u ∈ Σ∗ is a Feature Grammars (CBFG) which defines a substring of v ∈ Σ∗ if there are strings l, r ∈ Σ∗ class of grammars using contexts as features such that v = lur.

Proceedings of the EACL 2009 Workshop on Computational Linguistic Aspects of Grammatical Inference, pages 33–40, Athens, Greece, 30 March 2009. c 2009 Association for Computational Linguistics

33 A context is an element of Σ∗ × Σ∗. For a grammatical sentence I gave it away, but we string u and a context f = (l, r) we write f cannot reverse the process. For example, given u = lur; the insertion or wrapping operation. the sentence it is raining, we cannot substi- We extend this to sets of strings and contexts tute him for it, as we will get the ungrammat- in the natural way. A context is also known in ical sentence him is raining. Thus we observe structuralist linguistics as an environment. C(him) ( C(it). The set of contexts, or distribution, of a Looking at Lemma 2 we can also say that, string u of a language L is, CL(u) = {(l, r) ∈ if we have some finite set of strings K, where ∗ ∗ Σ × Σ |lur ∈ L}. We will often drop the we know the contexts, then: subscript where there is no ambiguity. We define the syntactic congruence as u ≡L v iff Corollary 1. CL(u) = CL(v). The equivalence classes un- der this relation are the congruence classes of [ [ [ the language. In general we will assume that C(w) ⊇ C(uv) λ is not a member of any language. u0,v0: u∈K: v∈K: u0v0=w C(u)⊆C(u0) C(v)⊆C(v0) 3 Contextual Binary Feature Grammars This is the basis of the representation: a Most definitions and lemmas of this section word w is characterised by its set of contexts. were first introduced in (Clark et al., 2008). We can compute the representation of w, from 0 0 3.1 Definition the representation of its parts u , v , by looking at all of the other matching strings u and v Before the presentation of the formalism, we where we understand how they combine (with give some results about contexts to help to subset inclusion). In order to illustrate this give an intuition of the representation. The concept, we give here a simple example. basic insight behind CBFGs is that there is a n n relation between the contexts of a string w and Consider the language {a b |n > 0} and the contexts of its substrings. This is given by the set K = {aabb, ab, abb, aab, a, b}. Suppose the following trivial lemma: we want to compute the set of contexts of aaabbb, Since C(abb) ⊆ C(aabbb), and vacu- Lemma 1. For any language L and for any ously C(a) ⊆ C(a), we know that C(aabb) ⊆ 0 0 0 strings u, u , v, v if C(u) = C(u ) and C(v) = C(aaabbb). More generally, the contexts of ab 0 0 0 C(v ), then C(uv) = C(u v ). can represent anbn, those of aab the strings We can also consider a slightly stronger result: an+1bn and the ones of abb the strings anbn+1. Lemma 2. For any language L and for any The key relationships are given by context strings u, u0, v, v0 if C(u) ⊆ C(u0) and C(v) ⊆ set inclusion. Contextual binary feature gram- C(v0), then C(uv) ⊆ C(u0v0). mars allow a proper definition of the combina- C(u) ⊆ C(u0) means that we can replace tion of context inclusion: any occurrence of u in a sentence, with a u0, without affecting the grammaticality, but not Definition 1. A Contextual Binary Feature necessarily vice versa. Note that none of these Grammar (CBFG) G is a tuple hF,P,PL, Σi. strings need to correspond to non-terminals: F is a finite set of contexts, called features, F this is valid for any fragment of a sentence. where we write C = 2 for the power set of F We will give a simplified example from En- defining the categories of the grammar, P ⊆ glish syntax: the pronoun it can occur every- C × C × C is a finite set of productions that where that the pronoun him can, but not vice we write x → yz where x, y, z ∈ C and PL ⊆ versa1. Thus given a sentence “I gave him C × Σ is a set of lexical rules, written x → a. away”, we can substitute it for him, to get the Normally PL contains exactly one production 1This example does not account for a number of syn- for each letter in the alphabet (the lexicon). tactic and semantic phenomena, particularly the distri- bution of reflexive anaphors. A CBFG G defines recursively a map fG

34 from Σ∗ → C as follows: Σ∗ → 2F which is just the map u 7→ {(l, r) ∈ F |lur ∈ L} = CL(u) ∩ F . f (λ) = ∅ (1) G Using this definition, we now need a cor- [ fG(w) = c iff |w| = 1 respondence between the language theoretic

(c→w)∈PL context feature map FL and the representa- (2) tion in the CBFG fG. [ [ fG(w) = x iff |w| > 1. Definition 4. A CBFG G is exact if for all ∗ u,v:uv=w x→yz∈P : u ∈ Σ , fG(u) = FL(G)(u). y⊆fG(u)∧ z⊆fG(v) Exact CBFGs are a more limited formalism (3) than CBFGs themselves; without any limits on the interpretation of the features, we can We give here more explanation about the define a class of formalisms that is equal to map fG. It defines in fact the analysis of a the class of Conjunctive Grammars (see Sec- string by a CBFG. A rule z → xy is applied tion 4). However, exactness is an important to analyse a string w if there is a cut uv = w notion because it allows to associate intrinsic s.t. x ⊆ fG(u) and y ⊆ fG(v), recall that x components of a language to strings. Contexts and y are sets of contexts. Intuitively, the re- are easily observable from a sample and more- lation given by the production rule is linked over it is only when the features correspond to with Lemma 2: z is included in the set of fea- the contexts that distributional learning algo- tures of w = uv. From this relationship, for rithms can infer the structure of the language. any (l, r) ∈ z we have lwr ∈ L(G). A basic example of such a learning algorithm The complete computation of fG is then jus- is given in (Clark et al., 2008). tified by Corollary 1: fG(w) defines all the possible features associated by G to w with all 3.2 A Parsing Example the possible cuts uv = w (i.e. all the possible To clarify the relationship with CFG derivations). parsing, we will give a simple worked Finally, the natural way to define the mem- example. Consider the CBFG G = bership of a string w in L(G) is to have the h{(λ, λ), (aab, λ), (λ, b), (λ, abb), (a, λ)(aab, λ)}, context (λ, λ) ∈ fG(w) which implies that P,PL, {a, b}i with PL = λuλ = u ∈ L(G). {{(λ, b), (λ, abb)} → a, {(a, λ), (aab, λ)} → b} Definition 2. The language defined by a and P = CBFG G is the set of all strings that are as- {{(λ, λ)} → {(λ, b)}{(aab, λ)}, signed the empty context: L(G) = {u|(λ, λ) ∈ {(λ, λ)} → {(λ, abb)}{(a, λ)}, {(λ, b)} → {(λ, abb)}{(λ, λ)}, fG(u)}. {(a, λ)} → {(λ, λ)}{(aab, λ)}}. As we saw before, we are interested in cases If we want to parse the string w = aabb the where there is a correspondence between the usual way is to have a bottom-up approach. language theoretic interpretation of a context, This means that we recursively compute the and the occurrence of that context as a feature fG map on the substrings of w in order to in the grammar. From the basic definition of check whether (λ, λ) belongs to fG(w). a CBFG, we do not require any specific con- The Figure 1 graphically gives the main dition on the features of the grammar, except steps of the computation of fG(aabb). Ba- that a feature is associated to a string if the sically there are two ways to split aabb that string appears in the context defined by the allow the derivation of the empty context: feature. However, we can also require that fG aab|b and a|abb. The first one correspond defines exactly all the possible features that to the top part of the figure while the sec- can be associated to a given string according ond one is drawn at the bottom. We can to the underlying language. see for instance that the empty context be- Definition 3. Given a finite set of contexts longs to fG(ab) thanks to the rule {(λ, λ)} → F = {(l1, r1),..., (ln, rn)} and a language L {(λ, abb)}{(a, λ)}: {(λ, abb)} ⊆ fG(a) and we can define the context feature map FL : {(a, λ)} ⊆ fG(b). But for symmetrical reasons

35 the result can also be obtained using the rule teresting to note that several formalisms in- {(λ, λ)} → {(λ, b)}{(aab, λ)}. troduced with the aim of representing natural As we trivially have fG(aa) = fG(bb) = ∅, languages share strong links with CBFG. since no right-hand side contains the concate- nation of the same two features, an induction Range Concatenation Grammars proof can be written to show that (λ, λ) ∈ Range Concatenation Grammars are a very n n fG(w) ⇔ w ∈ {a b : n > 0}. powerful formalism (Boullier, 2000), that is a current area of research in NLP.

f (aabb) {(λ,λ)} G ⊇ Lemma 3. For every CBFG G, there is

Rule: (λ,λ) → (λ,b) (aab,λ) a non-erasing positive range concatenation grammar of arity one, in 2-var form that de- f (aab) ⊇ {(λ,b)} G fines the same language. Rule: (λ,b) → (λ,abb) (λ,λ) Proof. Suppose G = hF,P,PL, Σi. Define f (ab) ⊇ {(λ,λ)} G a RCG with a set of predicates equal to F Rule: (λ,λ) → (λ,abb) (a,λ) and the following clauses, and the two vari- {(λ,b),(λ,abb)} {(λ,b),(λ,abb)} {(a,λ),(aab,λ)} {(a,λ),(aab,λ)} ables U, V . For each production x → yz in

G G G

G

f f f f P , for each f ∈ x, where y = {g1, . . . gi}, a a b b z = {h1, . . . hj} add clauses f f f f G G G G f(UV ) → g1(U), . . . gi(U), h1(V ), . . . hj(V ). For each lexical production {f1 . . . fk} → a {(λ,b),(λ,abb)} {(λ,b),(λ,abb)} {(a,λ),(aab,λ)} {(a,λ),(aab,λ)} add clauses fi(a) → . It is straightforward Rule: (λ,λ) → (λ,b) (aab,λ) to verify that f(w) `  iff f ∈ fG(w).

f (ab) ⊇ {(λ,λ)} G Conjunctive Grammar Rule: (a,λ) → (λ,λ) (aab,λ) A more exact correspondence is to the class of Conjunctive Grammars (Okhotin, 2001), in- f (abb) ⊇ {(a,λ)} G vented independently of RCGs. For every ev- Rule: (λ,λ) → (λ,abb) (a,λ) ery language L generated by a conjunctive grammar there is a CBFG representing L# f (aabb) ⊇ {(λ,λ)} G (where the special character # is not included in the original alphabet). Figure 1: The two derivations to obtain (λ, λ) Suppose we have a conjunctive grammar in fG(aabb) in the grammar G. G = hΣ,N,P,Si in binary normal form (as defined in (Okhotin, 2003)). We construct the equivalent CBFG G0 = hF,P 0,P , Σi as fol- This is a simple example that illustrates L lowed: the parsing of a string given a CBFG. This example does not characterize the power of • For every letter a we add a context (la, ra) CBFG since no right handside part is com- to F such that laara ∈ L; posed of more than one context. A more inter- esting, example with a context-sensitive lan- • For every rules X → a in P , we create a guage, will be presented in Section 7. rule {(la, ra)} → a in PL.

4 Non exact CBFGs • For every non terminal X ∈ N, for every rule X → P1Q1& ... &PnQn we add dis- The aim here is to study the expressive power tinct contexts {(lPiQi , rPiQi )} to F, such of CBFG compare to other formalism recently that for all i it exists ui, lPiQi uirPiQi ∈ L introduced. Though the inference can be done ∗ and PiQi ⇒G ui; only for exact CBFG, where features are di- rectly linked with observable contexts, it is • Let FX,j = {(lPiQi , rPiQi ): ∀i} the still worth having a look at the more general set of contexts corresponding to the characteristics of CBFG. For instance, it is in- jth rule applicable to X. For all

36 0 (lPiQi , rPiQi ) ∈ FX,j, we add to P the

rules (lPiQi , rPiQi ) → FPi,kFQi,l (∀k, l). • We add a new context (w, λ) to F such ∗ that S ⇒G w and (w, λ) → # to PL; • For all j, we add to P 0 the rule (λ, λ) → FS,j{(w, λ)}. It can be shown that this construction gives Figure 2: Example of a DFA. The left residuals an equivalent CBFG. are defined by λ−1L, a−1L, b−1L and the right ones by Lλ−1, Lb−1, Lab−1 (note here that 5 Regular Languages La−1 = Lλ−1). Any can be defined by an ex- act CBFG. In order to show this we will pro- F (u). Let K be finite set of strings formed pose an approach defining a canonical form for L L by taking the lexicographically shortest string representing any regular language. from each congruence class. The final gram- Suppose we have a regular language L, we mar can be obtained by combining elements consider the left and right residual languages: of KL. For every pair of strings u, v ∈ KL, we u−1L = {w|uw ∈ L} (4) define a rule

−1 Lu = {w|wu ∈ L} (5) FL(uv) → FL(u),FL(v) (7) They define two congruencies: if l, l0 ∈ u−1L and we add lexical productions of the form (resp. r, r0 ∈ Lu−1) then for all w ∈ Σ∗, lw ∈ F (a) → a, a ∈ Σ. L iff l0w ∈ L (resp. wr ∈ L iff wr0 ∈ L). L ∗ For any u ∈ Σ , let lmin(u) be the lexico- −1 Lemma 4. For all w ∈ Σ∗, f (w) = F (w). graphically shortest element such that lminL = G L u−1L. The number of such l is finite by min Proof. (Sketch) Proof in two steps: ∀w ∈ the Myhil-Nerode theorem, we denote by L min Σ∗,F (w) ⊆ f (w) and f (w) ⊆ F (w). Each this set, i.e. {l (u)|u ∈ Σ∗}. We de- L G G L min step is made by induction on the length of w fine symmetrically R for the right residuals min and uses the rules created to build the gram- (Lr−1 = Lu−1). min mar, the derivation process of a CBFG and We define the set of contexts as: the fiduciality for the second step. The key F (L) = Lmin × Rmin. (6) point rely on the fact that when a string w is parsed by a CBFG G, there exists a cut of w F (L) is clearly finite by construction. in uv = w (u, v ∈ Σ∗) and a rule z → xy in G If we consider the regular language de- such that x ⊆ fG(u) and y ⊆ fG(v). The rule fined by the deterministic finite automata z → xy is also obtained from a substring from of Figure 2, we obtain Lmin = {λ, a, b} the set used to build the grammar using the and Rmin = {λ, b, ab} and thus F (L) = FL map. By inductive hypothesis you obtain {(λ, λ), (a, λ), (b, λ), (λ, b), (a, b), (b, b), (λ, ab), inclusion between fG and FL on u and v. (a, ab), (b, ab)}. By considering this set of features, we For the language of Figure 2, the following can prove (using arguments about congruence set is sufficient to build an exact CBGF: classes) that for any strings u, v such that {a, b, aa, ab, ba, aab, bb, bba} (this corresponds FL(u) ⊃ FL(v), then CL(u) ⊃ CL(v). This to all the substrings of aab and bba). We have: means the set of feature F is sufficient to rep- FL(a) = F (L)\{(λ, λ), (a, λ)} → a resent context inclusion, we call this property FL(b) = F (L) → b the fiduciality. FL(aa) = FL(a) → FL(a),FL(a)

Note that the number of congruence classes FL(ab) = F (L) → FL(a),FL(b) = FL(a),F (L) of a regular language is finite. Each congru- FL(ba) = F (L) → FL(b),FL(a) = F (L),FL(a) ence class is represented by a set of contexts FL(bb) = F (L) → FL(b),FL(b) = F (L),F (L)

37 FL(aab) = FL(bba) = FL(ab) = FL(ba) may not correspond to the intrinsic property of the underlying language. A context could The approach presented here gives a canon- be associated with many non-terminals and we ical form for representing a regular language choose only one. For example, the context by an exact CBFG. Moreover, this is is com- (He is, λ) allows both noun phrases and ad- plete in the sense that every context of every jective phrases. In formal terms, the resulting substring will be represented by some element CBFG is not exact. Then, with the bijection of F (L): this CBFG will completely model the we introduced before, we are not able to char- relation between contexts and substrings. acterize the non-terminals by the contexts in which they could appear. This is clearly what 6 Context-Free Languages we don’t want here and we are more interested We now consider the relationship between in the relationship with exact CBFG. CFGs and CBFGs. 6.2 Not all CFLs have an exact CBFG Definition 5. A context-free grammar (CFG) We will show here that the class of context- is a quadruple G = (Σ,V,P,S). Σ is a fi- free grammars is not strictly included in the nite alphabet, V is a set of non terminals + class of exact CBFGs. First, the grammar (Σ ∩ V = ∅), P ⊆ V × (V ∪ Σ) is a finite defined in Section 3.2 is an exact CBFG for set of productions, S ∈ V is the start symbol. the context-free and non regular language In the following, we will suppose that a CFG {anbn|n > 0}, showing the class of exact is represented in Chomsky Normal Form, i.e. CBFG has some elements in the class of CFGs. every production is in the form N → UW with We give now a context-free language L that N, U, W ∈ V or N → a with a ∈ Σ. can not be defined by an exact CBFG: We will write uNv ⇒G uαv if there is a pro- ∗ n m n duction N → α ∈ P . ⇒G is the reflexive tran- L = {a b|n > 0} ∪ {a c |n > m > 0}. sitive closure of ⇒G. The language defined by ∗ ∗ Suppose that there exists an exact CBFG that a CFG G is L(G) = {w ∈ Σ |S ⇒G w}. recognizes it and let N be the length of the 6.1 A Simple Characterization biggest feature (i.e. the longuest left part of A simple approach to try to represent a CFG the feature). For any sufficiently large k > k k+1 by a CBFG is to define a bijection between the N, the sequences c and c share the same k k+1 set of non terminals and the set of context fea- features: FL(c ) = FL(c ). Since the CBFG tures. Informally we define each non terminal is exact we have FL(b) ⊆ FL(ck). Thus any k+1 by a single context and rewrite the productions derivation of a b could be a derivation of k+1 k of the grammar in the CBFG form. a c which does not belong to the language. To build the set of contexts F , it is sufficient However, this restriction does not mean that to choose |V | contexts such that a bijection bC the class of exact CBFG is too restrictive for can be defined between V and F with bC (N) = modelling natural languages. Indeed, the ex- (l, r) implies that S ⇒∗ lNr. Note that we fix ample we have given is highly unnatural and such phenomena appear not to occur in at- bT (S) = (λ, λ). Then, we can define a CBFG tested natural languages. hF,P 0,P 0 , Σi, where P 0 = {b (N) → L T 7 Context-Sensitive Languages bT (U)bT (W )|N → UW ∈ P } and 0 PL = {bT (N) → a|N → a ∈ P, a ∈ Σ}. We now show that there are some exact A similar proof showing that this construction CBFGs that are not context-free. In particu- produces an equivalent CBFG can be found lar, we define a language closely related to the in (Clark et al., 2008). MIX language (consisting of strings with an equal number of a’s, b’s and c’s in any order) If this approach allows a simple syntactical which is known to be non context-free, and convertion of a CFG into a CBFG, it is not indeed is conjectured to be outside the class relevant from an NLP point of view. Though of indexed grammars (Boullier, 2003). we associate a non-terminal to a context, this

38 Let M = {(a, b, c)∗}, we consider the language (λ, λ) → {(λ, f 0)}, {(f, λ)}. 0 0 0 0 0 0 L = Labc∪Lab∪Lac∪{a a, b b, c c, dd , ee , ff }: Lab = {wd|w ∈ M, |w|a = |w|b}, This set of rules is incomplete, since for rep- Lac = {we|w ∈ M, |w|a = |w|c}, resenting Lab, the grammar must contain the Labc = {wf|w ∈ M, |w|a = |w|b = |w|c}. rules ensuring to have the same number of a’s In order to define a CBFG recognizing L, we and b’s, and similarly for Lac. To lighten the have to select features (contexts) that can rep- presentation here, the complete grammar is resent exactly the intrinsic components of the presented in Annex. languages composing L. We propose to use the We claim this is an exact CBFG for a following set of features for each sublanguages: context-sensitive language. L is not context- free since if we intersect L with the regular • For L :(λ, d) and (λ, ad), (λ, bd). ab language {Σ∗d}, we get an instance of the

• For Lac:(λ, e) and (λ, ae), (λ, ce). non context-free MIX language (with d ap- pended). The exactness comes from the fact • For Labc:(λ, f). that we chose the contexts in order to ensure that strings belonging to a sublanguage can • For the letters a0, b0, c0, a, b, c we add: not belong to another one and that the deriva- (λ, a), (λ, b), (λ, c), (a0, λ), (b0, λ), (c0, λ). tion of a substring will provide all the possible • For the letters d, e, f, d0, e0, f 0 we add; correct features with the help of the union of (λ, d0), (λ, e0), (λ, f 0), (d, λ), (e, λ), (f, λ). all the possible derivations. Note that the Mix language on its own is Here, Lab will be represented by (λ, d), but we probably not definable by an exact CBFG: it will use (λ, ad), (λ, bd) to define the internal is only when other parts of the language can derivations of elements of Lab. The same idea distributionally define the appropriate partial holds for Lac with (λ, e) and (λ, ae), (λ, ce). structures that we can get context sensitive For the lexical rules and in order to have an languages. Far from being a limitation of this exact CBFG, note the special case for a, b, c: formalism (a bug), we argue this is a feature: 0 {(λ, bd), (λ, ce), (a , λ)} → a it is only in rather exceptional circumstances 0 {(λ, ad), (b , λ)} → b that we will get properly context sensitive lan- 0 {(λ, ad), (λ, ae), (c , λ)} → c guages. This formalism thus potentially ac- For the nine other letters, each one is defined counts not just for the existence of non context 0 with only one context like {(λ, d )} → d. free natural language but also for their rarity.

For the production rules, the most impor- 8 Conclusion tant one is: (λ, λ) → {(λ, d), (λ, e)}, {(λ, f 0)}. Indeed, this rule, with the presence of two The chart in Figure 3 summarises the different contexts in one of categories, means that an relationship shown in this paper. The substi- element of the language has to be derived tutable languages (Clark and Eyraud, 2007) so that it has a prefix u such that fG(u) ⊇ and the very simple ones (Yokomori, 2003) {(λ, d), (λ, e)}. This means u is both an ele- form two different learnable class of languages. ment of Lab and Lac. This rule represents the There is an interesting relationship with Mar- 0 language Labc since {(λ, f )} can only repre- cus External Contextual Grammars (Mitrana, sent the letter f. 2005): if we defined the language of a CBFG ∗ The other parts of the language will be to be the set {fG(u) u : u ∈ Σ } we would defined by the following rules: be taking some steps towards contextual gram- (λ, λ) → {(λ, d)}, {(λ, d0)}, mars. (λ, λ) → {(λ, e)}, {(λ, e0)}, In this paper we have discussed the weak (λ, λ) → {(λ, a)}, {(λ, bd), (λ, ce), (a0, λ)}, generative power of Exact Contextual Binary (λ, λ) → {(λ, b)}, {(λ, ad), (b0, λ)}, Feature Grammars; we conjecture that the (λ, λ) → {(λ, c)}, {(λ, ad), (λ, ae), (c0, λ)}, class of natural language stringsets lie in this (λ, λ) → {(λ, d0)}, {(d, λ)}, class. ECBFGs are efficiently learnable (see (λ, λ) → {(λ, e0)}, {(e, λ)}, (Clark et al., 2008) for details) which is a com-

39 Annex Context sensitive (λ, λ) → {(λ, d), (λ, e)}, {(λ, f 0)} (λ, λ) → {(λ, d)}, {(λ, d0)} very 0 simple (λ, λ) → {(λ, e)}, {(λ, e )} (λ, λ) → {(λ, a)}, {(λ, bd), (λ, ce), (a0, λ)} Regular (λ, λ) → {(λ, b)}, {(λ, ad), (b0, λ)} 0 Context-free (λ, λ) → {(λ, c)}, {(λ, ad), (λ, ae), (c , λ)} substi- 0 tutable (λ, λ) → {(λ, d )}, {(d, λ)} (λ, λ) → {(λ, e0)}, {(e, λ)} Exact CBFG (λ, λ) → {(λ, f 0)}, {(f, λ)} Conjunctive = CBFG Range Concatenation (λ, d) → {(λ, d)}, {(λ, d)} (λ, d) → {(λ, ad)}, {(λ, bd)} (λ, d) → {(λ, bd)}, {(λ, ad)} Figure 3: The relationship between CBFG and (λ, d) → {(λ, d)}, {(λ, ad), (λ, ae), (c0, λ)} other classes of languages. (λ, d) → {(λ, ad), (λ, ae), (c0, λ)}, {(λ, d)}

(λ, ad) → {(λ, ad), (λ, ae), (c0, λ)}, {(λ, ad)} pelling technical advantage of this formalism (λ, ad) → {(λ, ad)}, {(λ, ad), (λ, ae), (c0, λ)} over other more traditional formalisms such as (λ, ad) → {(λ, ad), (b0, λ)}, {(λ, d)} CFGs or TAGs. (λ, ad) → {(λ, d)}, {(λ, ad), (b0, λ)}

(λ, bd) → {(λ, ad), (λ, ae), (c0, λ)}, {(λ, bd)} References (λ, bd) → {(λ, bd)}, {(λ, ad), (λ, ae), (c0, λ)} Pierre Boullier. 2000. A Cubic Time Extension (λ, bd) → {(λ, bd), (λ, ce), (a0, λ)}, {(λ, d)} of Context-Free Grammars. Grammars, 3:111– (λ, bd) → {(λ, d)}, {(λ, bd), (λ, ce), (a0, λ)} 131. (λ, e) → {(λ, e)}, {(λ, e)} Pierre Boullier. 2003. Counting with range con- (λ, e) → {(λ, ae)}, {(λ, ce)} catenation grammars. Theoretical Computer (λ, e) → {(λ, ce)}, {(λ, ae)} Science, 293(2):391–416. (λ, e) → {(λ, e)}, {(λ, ad), (b0, λ)} 0 Alexander Clark and R´emiEyraud. 2007. Polyno- (λ, e) → {(λ, ad), (b , λ)}, {(λ, e)} mial identification in the limit of substitutable (λ, ae) → {(λ, ad), (b0, λ)}, {(λ, ae)} context-free languages. Journal of Machine (λ, ae) → {(λ, ae)}, {(λ, ad), (b0, λ)} Learning Research, 8:1725–1745, Aug. (λ, ae) → {(λ, ad), (λ, ae), (c0, λ)}, {(λ, e)} 0 Alexander Clark, R´emi Eyraud, and Amaury (λ, ae) → {(λ, e)}, {(λ, ad), (λ, ae), (c , λ)} Habrard. 2008. A polynomial algorithm for the (λ, ce) → {(λ, ad), (b0, λ)}, {(λ, ce)} inference of context free languages. In Proceed- (λ, ce) → {(λ, ce)}, {(λ, ad), (b0, λ)} ings of International Colloquium on Grammati- (λ, ce) → {(λ, bd), (λ, ce), (a0, λ)}, {(λ, e)} cal Inference, pages 29–42. Springer, September. (λ, ce) → {(λ, e)}, {(λ, bd), (λ, ce), (a0, λ)} V. Mitrana. 2005. Marcus external contextual {(λ, bd), (λ, ce), (a0, λ)} → a grammars: From one to many dimensions. Fun- {(λ, ad), (b0, λ)} → b damenta Informaticae, 54:307–316. {(λ, ad), (λ, ae), (c0, λ)} → c 0 Alexander Okhotin. 2001. Conjunctive grammars. {(λ, d )} → d J. Autom. Lang. Comb., 6(4):519–535. {(λ, e0)} → e {(λ, f 0)} → f Alexander Okhotin. 2003. An overview of con- 0 junctive grammars. Theory {(λ, a)} → a 0 Column, bulletin of the EATCS, 79:145–163. {(λ, b)} → b {(λ, c)} → c0 Takashi Yokomori. 2003. Polynomial-time iden- {(d, λ)} → d0 tification of very simple grammars from pos- 0 itive data. Theoretical Computer Science, {(e, λ)} → e 0 298(1):179–206. {(f, λ)} → f

40 Language models for contextual error detection and correction

Herman Stehouwer Menno van Zaanen Tilburg Centre for Creative Computing Tilburg Centre for Creative Computing Tilburg University Tilburg University Tilburg, The Netherlands Tilburg, The Netherlands [email protected] [email protected]

Abstract which they occur is required to recognize and cor- rect these errors. In contrast, non-word errors can The problem of identifying and correcting be recognized without context. confusibles, i.e. context-sensitive spelling One class of such errors, called confusibles, errors, in text is typically tackled using consists of words that belong to the language, but specifically trained machine learning clas- are used incorrectly with respect to their local, sifiers. For each different set of con- sentential context. For example, She owns to cars fusibles, a specific classifier is trained and contains the confusible to. Note that this word is tuned. a valid token and part of the language, but used incorrectly in the context. Considering the con- In this research, we investigate a more text, a correct and very likely alternative would be generic approach to context-sensitive con- the word two. Confusibles are grouped together in fusible correction. Instead of using spe- confusible sets. Confusible sets are sets of words cific classifiers, we use one generic clas- that are similar and often used incorrectly in con- sifier based on a language model. This text. Too is the third alternative in this particular measures the likelihood of sentences with confusible set. different possible solutions of a confusible The research presented here is part of a in place. The advantage of this approach larger project, which focusses on context-sensitive is that all confusible sets are handled by spelling mistakes in general. Within this project a single model. Preliminary results show all classes of context-sensitive spelling errors are that the performance of the generic clas- tackled. For example, in addition to confusibles, sifier approach is only slightly worse that a class of pragmatically incorrect words (where that of the specific classifier approach. words are incorrectly used within the document- wide context) is considered as well. In this arti- 1 Introduction cle we concentrate on the problem of confusibles, When writing texts, people often use spelling where the context is only as large as a sentence. checkers to reduce the number of spelling mis- 2 Approach takes in their texts. Many spelling checkers con- centrate on non-word errors. These errors can be A typical approach to the problem of confusibles easily identified in texts because they consist of is to train a machine learning classifier to a specific character sequences that are not part of the lan- confusible set. Most of the work in this area has guage. For example, in English woord is is not concentrated on confusibles due to homophony part of the language, hence a non-word error. A (to, too, two) or similar spelling (desert, dessert ). possible correction would be word. However, some research has also touched upon in- Even when a text does not contain any non- flectional or derivational confusibles such as I ver- word errors, there is no guarantee that the text is sus me (Golding and Roth, 1999). For instance, error-free. There are several types of spelling er- when word forms are homophonic, they tend to rors where the words themselves are part of the get confused often in writing (cf. the situation with language, but are used incorrectly in their context. to, too, and two, affect and effect, or there, their, Note that these kinds of errors are much harder and they’re in English) (Sandra et al., 2001; Van to recognize, as information from the context in den Bosch and Daelemans, 2007).

Proceedings of the EACL 2009 Workshop on Computational Linguistic Aspects of Grammatical Inference, pages 41–48, Athens, Greece, 30 March 2009. c 2009 Association for Computational Linguistics

41 Most work on confusible disambiguation using est probability according to the language model is machine learning concentrates on hand-selected then selected. Since the language model assigns sets of notorious confusibles. The confusible sets probabilities to all sequences of words, it is pos- are typically very small (two or three elements) sible to define new confusible sets on the fly and and the machine learner will only see training let the language model disambiguate them with- examples of the members of the confusible set. out any further training. Obviously, this is not This approach is similar to approaches used in ac- possible for a specialized machine learning clas- cent restoration (Yarowsky, 1994; Golding, 1995; sifier approach, where a classifier is fine-tuned to Mangu and Brill, 1997; Wu et al., 1999; Even- the features and classes of a specific confusible set. Zohar and Roth, 2000; Banko and Brill, 2001; The expected disadvantage of the generic (lan- Huang and Powers, 2001; Van den Bosch, 2006). guage model) classifier approach is that the accu- The task of the machine learner is to decide, us- racy is expected to be less than that of the specific ing features describing information from the con- (specialized machine learning classifier) approach. text, which word taken from the confusible set re- Since the specific classifiers are tuned to each spe- ally belongs in the position of the confusible. Us- cific confusible set, the weights for each of the ing the example above, the classifier has to decide features may be different for each set. For in- which word belongs on the position of the X in stance, there may be confusibles for which the cor- She owns X cars, where the possible answers for rect word is easily identified by words in a specific X are to, too, or two. We call X, the confusible position. If a determiner, like the, occurs in the po- that is under consideration, the focus word. sition directly before the confusible, to or too are Another way of looking at the problem of con- very probably not the correct answers. The spe- fusible disambiguation is to see it as a very spe- cific approach can take this into account by assign- cialized case of word prediction. The problem is ing specific weights to part-of-speech and position then to predict which word belongs at a specific combinations, whereas the generic approach can- position. Using similarities between these cases, not do this explicitly for specific cases; the weights we can use techniques from the field of language follow automatically from the training corpus. modeling to solve the problem of selecting the best In this article, we will investigate whether it is alternative from confusible sets. We will investi- possible to build a confusible disambiguation sys- gate this approach in this article. tem that is generic for all sets of confusibles using language models as generic classifiers and investi- Language models assign probabilities to se- gate in how far this approach is useful for solving quences of words. Using this information, it the confusible problem. We will compare these is possible to predict the most likely word in generic classifiers against specific classifiers that a certain context. If a language model gives are trained for each confusible set independently. us the probability for a sequence of n words PLM (w1,...,wn), we can use this to predict the 3 Results most likely word w following a sequence of n − 1 words arg maxw PLM (w1,...,wn−1, w). Obvi- To measure the effectiveness of the generic clas- ously, a similar approach can be taken with w in sifier approach to confusible disambiguation, and the middle of the sequence. to compare it against a specific classifier approach Here, we will use a language model as a classi- we have implemented several classification sys- fier to predict the correct word in a context. Since tems. First of these is a majority class baseline sys- a language model models the entire language, it is tem, which selects the word from the confusible different from a regular machine learning classifier set that occurs most often in the training data.1 trained on a specific set of confusibles. The advan- We have also implemented several generic classi- tage of this approach to confusible disambiguation fiers based on different language models. We com- is that the language model can handle all potential pare these against two machine learning classi- confusibles without any further training and tun- fiers. The machine learning classifiers are trained ing. With the language model it is possible to take separately for each different experiment, whereas the words from any confusible set and compute the 1This baseline system corresponds to the simplest lan- probabilities of those words in the context. The guage model classifier. In this case, it only uses n-grams with element from the confusible set that has the high- n = 1.

42 the parameters and the training material of the lan- ability. Since the probabilities of the n-grams are guage model are kept fixed throughout all the ex- multiplied, having a n-gram probability of zero re- periments. sults in a zero probability for the entire sequence. There may be two reasons for an n-gram to have 3.1 System description probability zero: there is not enough training data, so this sequence has not been seen yet, or this se- There are many different approaches that can be quence is not valid in the language. taken to develop language models. A well-known approach is to use n-grams, or Markov models. When it is known that a sequence is not valid These models take into account the probability in the language, this information can be used to that a word occurs in the context of the previous decide which word from the confusible set should n − 1 words. The probabilities can be extracted be selected. However, when the sequence simply from the occurrences of words in a corpus. Proba- has not been seen in the training data yet, we can- bilities are computed by taking the relative occur- not rely on this information. To resolve the se- rence count of the n words in sequence. quences with zero probability, we can use smooth- ing. However, this assumes that the sequence is In the experiments described below, we will use valid, but has not been seen during training. The a tri-gram-based language model and where re- other solution, back-off, tries not to make this as- quired this model will be extended with bi-gram sumption. It checks whether subsequences of the and uni-gram language models. The probability sequence are valid, i.e. have non-zero probabili- of a sequence is computed as the combination of ties. Because of this, we will not use smoothing to the probabilities of the tri-grams that are found in reach non-zero probabilities in the current exper- the sequence. iments, although this may be investigated further Especially when n-grams with large n are used, in the future. data sparseness becomes an issue. The training The first language model that we will investi- data may not contain any occurrences of the par- gate here is a linear combination of the differ- ticular sequence of n symbols, even though the ent n-grams. The probability of a sequence is sequence is correct. In that case, the probability computed by a linear combination of weighted n- extracted from the training data will be zero, even gram probabilities. We will report on two different though the correct probability should be non-zero weight settings, one system using uniform weight- (albeit small). To reduce this problem we can ei- ing, called uniform linear, and one where uni- ther use back-off or smoothing when the probabil- grams receive weight 1, bi-grams weight 138, and ity of an n-gram is zero. In the case of back-off, tri-grams weight 437.2 These weights are normal- the probabilities of lower order n-grams are taken ized to yield a final probability for the sequence, into account when needed. Alternatively, smooth- resulting in the second system called weighted lin- ing techniques (Chen and Goodman, 1996) redis- ear. tribute the probabilities, taking into account previ- ously unseen word sequences. The third system uses the probabilities of the different n-grams separately, instead of using the Even though the language models provide us probabilities of all n-grams at the same time as is with probabilities of entire sequences, we are done in the linear systems. The continuous back- only interested in the n-grams directly around the off method uses only one of the probabilities at confusible when using the language models in each position, preferring the higher-level probabil- the context of confusible disambiguation. The ities. This model provides a step-wise back-off. probabilities of the rest of the sequence will re- The probability of a sequence is that of the tri- main the same whichever alternative confusible grams contained in that sequence. However, if the is inserted in the focus word position. Fig- probability of a trigram is zero, a back-off to the ure 1 illustrates that the probability of for example probabilities of the two bi-grams of the sequence P (analysts had expected ) is irrelevant for the de- is used. If that is still zero, the uni-gram probabil- cision between then and than because it occurs in ity at that position is used. Note that this uni-gram both sequences. probability is exactly what the baseline system The different language models we will consider here are essentially the same. The differences lie 2These weights are selected by computing the accuracy of in how they handle sequences that have zero prob- all combinations of weights on a held out set.

43 . . . much stronger most analysts had expected .

than then P (much stronger than) P (much stronger then) ×P (stronger than most) ×P (stronger then most) ×P (than most analysts) ×P (then most analysts)

Figure 1: Computation of probabilities using the language model. uses. With this approach it may be the case that The focus word becomes the class that needs to the probability for one word in the confusible set be predicted. We show an example of both train- is computed based on tri-grams, whereas the prob- ing and testing in figure 2. Note that the features ability of another word in the set of confusibles is for the machine learning classifiers could be ex- based on bi-grams or even the uni-gram probabil- panded with, for instance, part-of-speech tags, but ity. Effectively, this means that different kinds of in the current experiments only the word forms are probabilities are compared. The same weights as used as features. in the weighted linear systems are used. In addition to the k-NN classifier, we also run To resolve the problem of unbalanced probabil- the experiments using the IGTree classifier, which ities, a fourth language model, called synchronous is denoted IGTree in the rest of the article, which is back-off, is proposed. Whereas in the case of the also contained in the TiMBL distribution. IGTree continuous back-off model, two words from the is a fast, trie based, approximation of k-nearest confusible set may be computed using probabil- neighbor classification (Knuth, 1973; Daelemans ities of different level n-grams, the synchronous et al., 1997). IGTree allows for fast training and back-off model uses probabilities of the same level testing even with millions of examples. IGTree of n-grams for all words in the confusible set, with compresses a set of labeled examples into a deci- n being the highest value for which at least one of sion tree structure similar to the classic C4.5 algo- the words has a non-zero probability. For instance, rithm (Quinlan, 1993), except that throughout one when word a has a tri-gram probability of zero and level in the IGTree decision tree, the same feature word b has a non-zero tri-gram probability, b is se- is tested. Classification in IGTree is a simple pro- lected. When both have a zero tri-gram probabil- cedure in which the decision tree is traversed from ity, a back-off to bi-grams is performed for both the root node down, and one path is followed that words. This is in line with the idea that if a proba- matches the actual values of the new example to bility is zero, the training data is sufficient, hence be classified. If a leaf is found, the outcome stored the sequence is not in the language. at the leaf of the IGTree is returned as the clas- To implement the specific classifiers, we used sification. If the last node is not a leaf node, but the TiMBL implementation of a k-NN classifier there are no outgoing arcs that match a feature- (Daelemans et al., 2007). This implementation of value combination of the instance, the most likely the k-NN algorithm is called IB1. We have tuned outcome stored at that node is produced as the re- the different parameter settings for the k-NN clas- sulting classification. This outcome is computed sifier using Paramsearch (Van den Bosch, 2004), by collating the outcomes of all leaf nodes that can which resulted in a k of 35.3 To describe the in- be reached from the node. stances, we try to model the data as similar as pos- IGTree is typically able to compress a large sible to the data used by the generic classifier ap- example set into a lean decision tree with high proach. Since the language model approaches use compression factors. This is done in reasonably n-grams with n = 3 as the largest n, the features short time, comparable to other compression al- for the specific classifier approach use words one gorithms. More importantly, IGTree’s classifica- and two positions left and right of the focus word. tion time depends only on the number of features (O(f)). Indeed, in our experiments we observe 3We note that k is handled slightly differently in TiMBL high compression rates. One of the unique char- than usual, k denotes the number of closest distances consid- ered. So if there are multiple instances that have the same acteristics of IGTree compared to basic k-NN is (closest) distance they are all considered. its resemblance to smoothing of a basic language

44 Training . . . much stronger than most analysts had expected .

hmuch, stronger, most, analystsi ⇒than

Testing . . . much stronger most analysts had expected .

hmuch, stronger, most, analystsi⇒?

Figure 2: During training, a classified instance (in this case for the confusible pair {then, than}) are generated from a sentence. During testing, a similar instance is generated. The classifier decides what the corresponding class, and hence, which word should be the focus word. model (Zavrel and Daelemans, 1997), while still {their, there, they’re}. To compare the difficulty being a generic classifier that supports any number of these problems, we also selected two words at and type of features. For these reasons, IGTree is random and used them as a confusible set. also included in the experiments. The random category consists of two words that where randomly selected from all words in the 3.2 Experimental settings Reuters corpus that occurred more than a thousand The probabilities used in the language models of times. The words that where chosen, and used for the generic classifiers are computed by looking at all experiments here are refugees and effect. They occurrences of n-grams. These occurrences are occur around 27 thousand times in the Reuters cor- extracted from a corpus. The training instances pus. used in the specific machine learning classifiers are also extracted from the same data set. For 3.3 Empirical results training purposes, we used the Reuters news cor- Table 1 sums up the results we obtained with the pus RCV1 (Lewis et al., 2004). The Reuters cor- different systems. The baseline scores are gen- pus contains about 810,000 categorized newswire erally very high, which tells us that the distribu- stories as published by Reuters in 1996 and 1997. tion of classes in a single confusible set is severely This corpus contains around 130 million tokens. skewed, up to a ten to one ratio. This also makes For testing purposes, we used the Wall Street the task hard. There are many examples for one Journal part of the Penn Treebank corpus (Marcus word in the set, but only very few training in- et al., 1993). This well-known corpus contains ar- stances for the other(s). However, it is especially ticles from the Wall Street Journal in 1987 to 1989. important to recognize the important aspects of the We extract our test-instances from this corpus in minority class. the same way as we extract our training data from The results clearly show that the specific clas- the Reuters corpus. There are minor tokenization sifier approaches outperform the other systems. differences between the corpora. The data is cor- For instance, on the first task ({then, than}) the rected for these differences. classifier achieves an accuracy slightly over 98%, Both corpora are in the domain of English lan- whereas the language model systems only yield guage news texts, so we expect them to have simi- around 96%. This is as expected. The classifier lar properties. However, they are different corpora is trained on just one confusible task and is there- and hence are slightly different. This means that fore able to specialize on that task. there are also differences between the training and Comparing the two specific classifiers, we see testing set. We have selected this division to cre- that the accuracy achieved by IB1 and IGTree is ate a more realistic setting. This should allow for a quite similar. In general, IGTree performs a bit more to real-world use comparison than when both worse than IB1 on all confusible sets, which is training and testing instances are extracted from as expected. However, in general it is possible the same corpus. for IGTree to outperform IB1 on certain tasks. In For the specific experiments, we selected a our experience this mainly happens on tasks where number of well-known confusible sets to test the usage of IGTree, allowing for more compact the different approaches. In particular, we internal representations, allows one to use much look at {then, than}, {its, it’s}, {your, you’re}, more training data. IGTree also leads to improved

45 {then, than} {its, it’s} {your, you’re} {their, there, they’re} random Baseline 82.63 92.42 78.55 68.36 93.16 IB1 98.01 98.67 96.36 97.12 97.89 IGTree 97.07 96.75 96.00 93.02 95.79 Uniform linear 68.27 50.70 31.64 32.72 38.95 Weighted linear 94.43 92.88 93.09 93.25 88.42 Continuous back-off 81.49 83.22 74.18 86.01 63.68 Synchronous back-off 96.42 94.10 92.36 93.06 87.37 Number of cases 2,458 4,830 275 3,053 190

Table 1: This table shows the performance achieved by the different systems, shown in accuracy (%). The Number of cases denotes the number of instances in the testset. performance in cases where the features have a sifier approaches perform consistently across the strong, absolute ordering of importance with re- different confusible sets. The synchronous back- spect to the classification problem at hand. off approach is the best performing generic clas- The generic language model approaches per- sifier approach we tested. It consistently outper- form reasonably well. However, there are clear forms the baseline, and overall performs better differences between the approaches. For instance than the weighted linear approach. the weighted linear and synchronous back-off ap- The experiments show that generic classifiers proaches work well, but uniform linear and con- based on language model can be used in the con- tinuous back-off perform much worse. Especially text of confusible disambiguation. However, the the synchronous back-off approach achieves de- n in the different n-grams is of major importance. cent results, regardless of the confusible problem. Exactly which n grams should be used to com- It is not very surprising to see that the contin- pute the probability of a sequence requires more uous back-off method performs worse than the research. The experiments also show that ap- synchronous back-off method. Remember that proaches that concentrate on n-grams with larger the continuous back-off method always uses lower n yield more encouraging results. level n-grams when zero probabilities are found. This is done independently of the probabilities of 4 Conclusion and future work the other words in the confusible set. The contin- Confusibles are spelling errors that can only be de- uous back-off method prefers n-grams with larger tected within their sentential context. This kind n, however it does not penalize backing off to an of errors requires a completely different approach n-gram with smaller n. Combine this with the fact compared to non-word errors (errors that can be that n-gram probabilities with large n are compar- identified out of context, i.e. sequences of char- atively lower than those for n-grams with smaller acters that do not belong to the language). In n and it becomes likely that a bi-gram contributes practice, most confusible disambiguation systems more to the erroneous option than the correct tri- are based on machine learning classification tech- gram does to the correct option. Tri-grams are niques, where for each type of confusible, a new more sparse than bi-grams, given the same data. classifier is trained and tuned. The weighted linear approach outperforms the In this article, we investigate the use of language uniform linear approach by a large margin on all models in the context of confusible disambigua- confusible sets. It is likely that the contribution tion. This approach works by selecting the word from the n-grams with large n overrules the prob- in the set of confusibles that has the highest prob- abilities of the n-grams with smaller n in the uni- ability in the sentential context according to the form linear method. This causes a bias towards the language model. Any kind of language model can more frequent words, compounded by the fact that be used in this approach. bi-grams, and uni-grams even more so, are less The main advantage of using language models sparse and therefore contribute more to the total as generic classifiers is that it is easy to add new probability. sets of confusibles without retraining or adding ad- We see that the both generic and specific clas- ditional classifiers. The entire language is mod-

46 eled, which means that all the information on References words in their context is inherently present. Banko, M. and Brill, E. (2001). Scaling to very very The experiments show that using generic clas- large corpora for natural language disambiguation. In Proceedingsof the 39th AnnualMeeting of the As- sifiers based on simple n-gram language models sociation for Computational Linguistics, pages 26– yield slightly worse results compared to the spe- 33. Association for Computational Linguistics. cific classifier approach, where each classifier is Chen, S. and Goodman, J. (1996). An empirical study specifically trained on one confusible set. How- of smoothing techniques for language modelling. In ever, the advantage of the generic classifier ap- Proceedings of the 34th Annual Meeting of the ACL, proach is that only one system has to be trained, pages 310–318. ACL. compared to different systems for each confusible Daelemans, W., Van den Bosch, A., and Weijters, A. in the specific classifier case. Also, the exact com- (1997). IGTree: using trees for compression and putation of the probabilities using the n-grams, in classification in lazy learning algorithms. Artificial particular the means of backing-off, has a large Intelligence Review, 11:407–423. impact on the results. Daelemans, W., Zavrel, J., Van der Sloot, K., and Van den Bosch, A. (2007). TiMBL: Tilburg Memory As future work, we would like to investigate the Based Learner, version 6.1, reference guide. Techni- accuracy of more complex language models used cal Report ILK 07-07, ILK Research Group, Tilburg as classifiers. The n-gram language models de- University. scribed here are relatively simple, but more com- de la Higuera, C. (2005). A bibliographical study plex language models could improve performance. of grammatical inference. Pattern Recognition, In particular, instead of back-off, smoothing tech- 38(9):1332 – 1348. Grammatical Inference. niques could be investigated to reduce the impact Even-Zohar, Y. and Roth, D. (2000). A classification of zero probability problems (Chen and Goodman, approach to word prediction. In Proceedings of the 1996). This assumes that the training data we are First North-American Conference on Computational currently working with is not enough to properly Linguistics, pages 124–131, New Brunswick, NJ. describe the language. ACL. Additionally, language models that concentrate Golding, A. and Roth, D. (1999). A Winnow-Based Approach to Context-Sensitive Spelling Correction. on more structural descriptions of the language, Machine Learning, 34(1–3):107–130. for instance, using grammatical inference tech- niques (de la Higuera, 2005), or models that ex- Golding, A. R. (1995). A Bayesian hybrid method for context-sensitive spelling correction. In Proceed- plicitly take long distance dependencies into ac- ings of the 3rd workshop on very large corpora, count (Griffiths et al., 2005) can be investigated. ACL-95. This leads to much richer language models that could, for example, check whether there is already Griffiths, T. L., Steyvers, M., Blei, D. M., and Tenen- baum, J. B. (2005). Integratingtopics and syntax. In a verb in the sentence (which helps in cases such In Advances in Neural Information Processing Sys- as {its, it’s}). tems 17, pages 537–544. MIT Press. A different route which we would also like to in- Huang,J. H. and Powers, D. W. (2001). Largescale ex- vestigate is the usage of a specific classifier, such periments on correction of confused words. In Aus- as TiMBL’s IGTree, as a language model. If a tralasian Computer Science Conference Proceed- ings, pages 77–82, Queensland AU. Bond Univer- classifier is trained to predict the next word in the sity. sentence or to predict the word at a given position with both left and right context as features, it can Knuth, D. E. (1973). The art of computer program- ming, volume 3: Sorting and searching. Addison- be used to estimate the probability of the words in Wesley, Reading, MA. a confusible set, just like the language models we have looked at so far. Another type of classifier Lewis, D. D., Yang, Y., Rose, T. G., Dietterich, G., Li, F., and Li, F. (2004). Rcv1: A new benchmark col- might estimate the perplexity at a position, or pro- lection for text categorization research. Journal of vide some other measure of “surprisedness”. Ef- Machine Learning Research, 5:361–397. fectively, these approaches all take a model of the entire language (as described in the training data) Mangu, L. and Brill, E. (1997). Automatic rule ac- quisition for spelling correction. In Proceedings of into account. the International Conference on Machine Learning, pages 187–194.

47 Marcus, M., Santorini, S., and Marcinkiewicz, M. (1993). Building a Large Annotated Corpus of En- glish: the Penn Treebank. Computational Linguis- tics, 19(2):313–330.

Quinlan, J. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA. Sandra, D., Daems, F., and Frisson, S. (2001). Zo helder en toch zoveel fouten! wat leren we uit psy- cholingu¨ıstisch onderzoek naar werkwoordfouten bij ervaren spellers? Tijdschrift van de Vereniging voor het Onderwijs in het Nederlands, 30(3):3–20. Van den Bosch, A. (2004). Wrapped progressive sampling search for optimizing learning algorithm parameters. In Verbrugge, R., Taatgen, N., and Schomaker, L., editors, Proceedings of the Sixteenth Belgian-Dutch Conference on Artificial Intelligence, pages 219–226, Groningen, The Netherlands. Van den Bosch, A. (2006). Scalable classification- based word prediction and confusible correction. Traitement Automatique des Langues, 46(2):39–63. Van den Bosch, A. and Daelemans, W. (2007). Tussen Taal, Spelling en Onderwijs, chapter Dat gebeurd mei niet: Computationele modellen voor verwarbare homofonen, pages 199–210. Academia Press. Wu, D., Sui, Z., and Zhao, J. (1999). An information- based method for selecting feature types for word prediction. In Proceedings of the Sixth European Conference on Speech Communication and Technol- ogy, EUROSPEECH’99, Budapest. Yarowsky, D. (1994). Decision lists for lexical ambi- guity resolution: application to accent restoration in Spanish and French. In Proceedings of the Annual Meeting of the ACL, pages 88–95. Zavrel, J. and Daelemans, W. (1997). Memory-based learning: Using similarity for smoothing. In Pro- ceedings of the 35th Annual Meeting of the Associa- tion for Computational Linguistics, pages 436–443.

48 On statistical parsing of French with supervised and semi-supervised strategies

Marie Candito*, Benoît Crabbé* and Djamé Seddah⋄

* Université Paris 7 ⋄ Université Paris 4 UFRL et INRIA (Alpage) LALIC et INRIA (Alpage) 30 rue du Château des Rentiers 28 rue Serpente F-75013 Paris — France F-75006 Paris — France

Abstract regularities of the language, in order to apply these patterns to unseen data. This paper reports results on grammati- Since capturing underlying linguistic rules is cal induction for French. We investigate also an objective for linguists, it makes sense how to best train a parser on the French to use supervised learning from linguistically- Treebank (Abeillé et al., 2003), viewing defined generalizations. One generalization is the task as a trade-off between generaliz- typically the use of phrases, and phrase-structure ability and interpretability. We compare, rules that govern the way words are grouped to- for French, a supervised lexicalized pars- gether. It has to be stressed that these syntactic ing algorithm with a semi-supervised un- rules exist at least in part independently of seman- lexicalized algorithm (Petrov et al., 2006) tic interpretation. along the lines of (Crabbé and Candito, Interpretability But the main reason to use su- 2008). We report the best results known pervised learning for parsing, is that we want to us on French statistical parsing, that we structures that are as interpretable as possible, in obtained with the semi-supervised learn- order to extract some knowledge from the anal- ing algorithm. The reported experiments ysis (such as deriving a semantic analysis from can give insights for the task of grammat- a parse). Typically, we need a syntactic analysis ical learning for a morphologically-rich to reflect how words relate to each other. This language, with a relatively limited amount is our main motivation to use supervised learn- of training data, annotated with a rather ing : the learnt parser will output structures as flat structure. defined by linguists-annotators, and thus inter- pretable within the linguistic theory underlying the 1 Natural language parsing annotation scheme of the treebank. It is important Despite the availability of annotated data, there to stress that this is more than capturing syntactic have been relatively few works on French statis- regularities : it has to do with the meaning of the tical parsing. Together with a treebank, the avail- words. ability of several supervised or semi-supervised It is not certain though that both requirements grammatical learning algorithms, primarily set up (generalizability / interpretability) are best met in on English data, allows us to figure out how they the same structures. In the case of supervised behave on French. learning, this leads to investigate different instan- Before that, it is important to describe the char- tiations of the training trees, to help the learning, acteristics of the parsing task. In the case of sta- while keeping the maximum interpretability of the tistical parsing, two different aspects of syntactic trees. As we will see with some of our experi- structures are to be considered : their capacity to ments, it may be necessary to find a trade-off be- capture regularities and their interpretability for tween generalizability and interpretability. further processing. Further, it is not guaranteed that syntactic rules Generalizability Learning for statistical parsing infered from a manually annotated treebank pro- requires structures that capture best the underlying duce the best language model. This leads to

Proceedings of the EACL 2009 Workshop on Computational Linguistic Aspects of Grammatical Inference, pages 49–57, Athens, Greece, 30 March 2009. c 2009 Association for Computational Linguistics

49 methods that use semi-supervised techniques on FTB, versus 24 tokens in the PTB. a treebank-infered grammar backbone, such as Inflection : French morphology is richer than En- (Matsuzaki et al., 2005; Petrov et al., 2006). glish and leads to increased data sparseness for The plan of the paper is as follows : in the statistical parsing. There are 24098 types in the next section, we describe the available treebank FTB, entailing an average of 16 tokens occurring for French, and how its structures can be inter- for each type (versus 12 for the PTB). preted. In section 3, we describe the typical prob- Flat structure : The annotation scheme is flatter lems encountered when parsing using a plain prob- in the FTB than in the PTB. For instance, there abilistic context-free grammar, and existing algo- are no VPs for finite verbs, and only one sentential rithmic solutions that try to circumvent these prob- level for sentences whether introduced by comple- lems. Next we describe experiments and results mentizer or not. We can measure the corpus flat- when training parsers on the French data. Finally, ness using the ratio between tokens and non ter- we discuss related work and conclude. minal symbols, excluding preterminals. We obtain 0.69 NT symbol per token for FTB and 1.01 for the 2 Interpreting the French trees PTB. Compounds : Compounds are explicitly annotated The French Treebank (Abeillé et al., 2003) is a (see the compound peut-être in Figure 1 ) and very publicly available sample from the newspaper Le frequent : 14,52% of tokens are part of a com- Monde, syntactically annotated and manually cor- pound. They include digital numbers (written with rected for French. spaces in French 10 000), very frozen compounds pomme de terre (potato) but also named entities le or sequences whose meaning is compositional but bilan where insertion is rare or difficult (garde d’enfant n’ (child care)). est Now let us focus on what is expressed in the French annotation scheme, and why syntactic in- peut - formation is split between constituency and func- être tional annotation. pas Syntactic categories and constituents capture dis- aussi tributional generalizations. A syntactic category sombre groups forms that share distributional properties. . Nonterminal symbols that label the constituents are a further generalizations over sequences of cat- Figure 1: Simplified example of the FTB egories or constituents. For instance about any- where it is grammatical to have a given NP, it is To encode syntactic information, it uses a com- implicitly assumed that it will also be grammati- bination of labeled constituents, morphological cal - though maybe nonsensical - to have instead annotations and functional annotation for verbal any other NPs. Of course this is known to be false dependents as illustrated in Figure 1. This con- in many cases : for instance NPs with or with- stituent and functional annotation was performed out determiners have very different distributions in in two successive steps : though the original re- French (that may justify a different label) but they lease (Abeillé et al., 2000) consists of 20,648 sen- also share a lot. Moreover, if words are taken into tences (hereafter FTB-V0), the functional annota- account, and not just sequences of categories, then tion was performed later on a subset of 12351 sen- constituent labels are a very coarse generalization. tences (hereafter FTB). This subset has also been Constituents also encode dependencies : for in- revised, and is known to be more consistently an- stance the different PP-attachment for the sen- notated. This is the release we use in our experi- tences I ate a cake with cream / with a fork re- ments. Its key properties, compared with the Penn flects that with cream depends on cake, whereas Treebank, (hereafter PTB) are the following : with a fork depends on ate. More precisely, a Size : The FTB is made of 385 458 tokens and syntagmatic tree can be interpreted as a depen- 12351 sentences, that is the third of the PTB. The dency structure using the following conventions : average length of a sentence is 31 tokens in the

50 for each constituent, given the dominating symbol purpose of natural language parsing : the inde- and the internal sequence of symbols, (i) a head pendence assumptions made by the model are too symbol can be isolated and (ii) the siblings of that strong. In other words all decisions are local to a head can be interpreted as containing dependents grammar rule. of that head. Given these constraints, the syntag- However as clearly pointed out by (Johnson, matic structure may exhibit various degree of flat- 1998) decisions have to take into account non lo- ness for internal structures. cal grammatical properties: for instance a noun Functional annotation Dependencies are en- phrase realized in subject position is more likely to coded in constituents. While X-bar inspired con- be realized by a pronoun than a noun phrase real- stituents are supposed to contain all the syntac- ized in object position. Solving this first method- tic information, in the FTB the shape of the con- ological issue, has led to solutions dubbed here- stituents does not necessarily express unambigu- after as unlexicalized statistical parsing (Johnson, ously the type of dependency existing between a 1998; Klein and Manning, 2003a; Matsuzaki et head and a dependent appearing in the same con- al., 2005; Petrov et al., 2006). stituent. Yet this is crucial for example to ex- A second class of non local decisions to be tract the underlying predicate-argument structures. taken into account while parsing natural languages This has led to a “flat” annotation scheme, com- are related to handling lexical constraints. As pleted with functional annotations that inform on shown above the subcategorization properties of the type of dependency existing between a verb a predicative word may have an impact on the de- and its dependents. This was chosen for French cisions concerning the tree structures to be asso- to reflect, for instance, the possibility to mix post- ciated to a given sentence. Solving this second verbal modifiers and complements (Figure 2), or methodological issue has led to solutions dubbed to mix post-verbal subject and post-verbal indi- hereafter as lexicalized parsing (Charniak, 2000; rect complements : a post verbal NP in the FTB Collins, 1999). can correspond to a temporal modifier, (most of- In a supervised setting, a third and practical ten) a direct object, or an inverted subject, and in problem turns out to be critical: that of data the three cases other subcategorized complements sparseness since available treebanks are generally may appear. too small to get reasonable probability estimates. SENT Three class of solutions are possible to reduce data NP-SUJ VN NP-MOD PP-AOBJ sparseness: (1) enlarging the data manually or au- D N V V V D N A P NP tomatically (e.g. (McClosky et al., 2006) uses self- une lettre avait été envoyée la semaine dernière aux N training to perform this step) (2) smoothing, usu- salariés ally this is performed using a markovization pro- SENT cedure (Collins, 1999; Klein and Manning, 2003a) NP-SUJ VN NP-OBJ PP-AOBJ and (3) make the data more coarse (i.e. clustering). D N V V D N P NP Le Conseil a notifié sa décision à D N 3.1 Lexicalized algorithm la banque The first algorithm we use is the lexicalized parser Figure 2: Two examples of post-verbal NPs : a of (Collins, 1999). It is called lexicalized, as it direct object and a temporal modifier annotates non terminal nodes with an additional latent symbol: the head word of the subtree. This additional information attached to the categories 3 Algorithms for probabilistic grammar aims at capturing bilexical dependencies in order learning to perform informed attachment choices. We propose here to investigate how to apply statis- The addition of these numerous latent sym- tical parsing techniques mainly tested on English, bols to non terminals naturally entails an over- to another language – French –. In this section we specialization of the resulting models. To en- briefly introduce the algorithms investigated. sure generalization, it therefore requires to add Though Probabilistic Context Free Grammars additional simplifying assumptions formulated as (PCFG) is a baseline formalism for probabilistic a variant of usual naïve Bayesian-style simplify- parsing, it suffers a fundamental problem for the ing assumptions: the probability of emitting a non

51 head node is assumed to depend on the head and the fact that an NP in subject position is more the mother node only, and not on other sibling likely realized as a pronoun. nodes1. The first unlexicalized algorithms set up in this Since Collins demonstrated his models to sig- trend (Johnson, 1998; Klein and Manning, 2003a) nificantly improve parsing accuracy over bare also use language dependent and manually de- PCFG, lexicalization has been thought as a ma- fined heuristics to add the latent annotations. The jor feature for probabilistic parsing. However two specialization induced by this additional annota- problems are worth stressing here: (1) the reason tion is counterbalanced by simplifying assump- why these models improve over bare PCFGs is not tions, dubbed markovization (Klein and Manning, guaranteed to be tied to the fact that they capture 2003a). bilexical dependencies and (2) there is no guar- Using hand-defined heuristics remains prob- antee that capturing non local lexical constraints lematic since we have no guarantee that the latent yields an optimal language model. annotations added in this way will allow to extract Concerning (1) (Gildea, 2001) showed that full an optimal language model. lexicalization has indeed small impact on results : A further development has been first introduced he reimplemented an emulation of Collins’ Model by (Matsuzaki et al., 2005) who recasts the prob- 1 and found that removing all references to bilex- lem of adding latent annotations as an unsuper- 2 ical dependencies in the statistical model , re- vised learning problem: given an observed PCFG sulted in a very small parsing performance de- induced from the treebank, the latent grammar is crease (PARSEVAL recall on WSJ decreased from generated by combining every non terminal of the 86.1 to 85.6). Further studies conducted by (Bikel, observed grammar to a predefined set H of latent 2004a) proved indeed that bilexical information symbols. The parameters of the latent grammar were used by the most probable parses. The idea are estimated from the observed trees using a spe- is that most bilexical parameters are very similar cific instantiation of EM. to their back-off distribution and have therefore a This first procedure however entails a combi- minor impact. In the case of French, this fact can natorial explosion in the size of the latent gram- only be more true, with one third of training data mar as |H| increases. (Petrov et al., 2006) (here- compared to English, and with a much richer in- after BKY) overcomes this problem by using the flection that worsens lexical data sparseness. following algorithm: given a PCFG G0 induced Concerning (2) the addition of head word an- from the treebank, iteratively create n grammars notations is tied to the use of manually defined G1 ...Gn (with n = 5 in practice), where each heuristics highly dependent on the annotation iterative step is as follows : scheme of the PTB. For instance, Collins’ mod- els integrate a treatment of coordination that is not • SPLIT Create a new grammar Gi from Gi−1 adequate for the FTB-like coordination annotation. by splitting every non terminal of Gi in two new symbols. Estimate Gi’s parameters 3.2 Unlexicalized algorithms on the observed treebank using a variant of Another class of algorithms arising from (John- inside-outside. This step adds the latent an- son, 1998; Klein and Manning, 2003a) attempts notation to the grammar. to attach additional latent symbols to treebank cat- • MERGE For each pair of symbols obtained egories without focusing exclusively on lexical by a previous split, try to merge them back. head words. For instance the additional annota- If the likelihood of the treebank does not tions will try to capture non local preferences like get significantly lower (fixed threshold) then 1This short description cannot do justice to (Collins, keep the symbol merged, otherwise keep the 1999) proposal which indeed includes more fine grained in- split. formations and a backoff model. We only keep here the key aspects of his work relevant for the current discussion. 2Let us consider a dependent constituent C with head • SMOOTH This step consists in smoothing the word Chw and head tag Cht, and let C be governed by a con- probabilities of the grammar rules sharing the stituent H, with head word Hhw and head tag Hht. Gildea same left hand side. compares Collins model, where the emission of Chw is con- ditioned on Hhw, and a “mono-lexical” model, where the emission of Chw is not conditioned on Hhw. This algorithm yields state-of-the-art results on

52 English3. Its key interest is that it directly aims adapted these clues to French, following (Arun at finding an optimal language model without (1) and Keller, 2005). making additional assumptions on the annotation Finally we use as a baseline a standard PCFG scheme and (2) without relying on hand-defined algorithm, coupled with a trigram tagger (we refer heuristics. This may be viewed as a case of semi- to this setup as TNT/LNCKY algorithm4). supervised learning algorithm since the initial su- pervised learning step is augmented with a second Metrics For evaluation, we use the standard PAR- step of unsupervised learning dedicated to assign SEVAL metric of labeled precision/recall, along the latent symbols. with unlabeled dependency evaluation, which is known as a more annotation-neutral metric. Unla- 4 Experiments and Results beled dependencies are computed using the (Lin, We investigate how some treebank features impact 1995) algorithm, and the Dybro-Johansen’s head 5 learning. We describe first the experimental pro- propagation rules cited above . The unlabeled tocol, next we compare results of lexicalized and dependency F-score gives the percentage of in- unlexicalized parsers trained on various “instan- put words (excluding punctuation) that receive the tiations” of the xml source files of the FTB, and correct head. the impact of training set size for both algorithms. As usual for probabilistic parsing results, the re- Then we focus on studying how words impact the sults are given for sentences of the test set of less results of the BKYalgorithm. than 40 words (which is true for 992 sentences of the test set), and punctuation is ignored for F-score 4.1 Protocol computation with both metrics. Treebank setting For all experiments, the tree- bank is divided into 3 sections : training (80%), 4.2 Comparison using minimal tagsets development (10%) and test (10%), made of We first derive from the FTB a minimally- respectively 9881, 1235 and 1235 sentences. informed treebank, TREEBANKMIN, instantiated We systematically report the results with the from the xml source by using only the major syn- compounds merged. Namely, we preprocess the tactic categories and no other feature. In each ex- treebank in order to turn each compound into a periment (Table 1) we observe that the BKY al- single token both for training and test. gorithm significantly outperforms Collins models, for both metrics. Software and adaptation to French For the Collins algorithm, we use Bikel’s implementation parser BKY BIKEL BIKEL TNT/ metric M1 M2 LNCKY (Bikel, 2004b) (hereafter BIKEL), and we report PARSEVAL LP 85.25 78.86 80.68 68.74 results using Collins model 1 and model 2, with PARSEVAL LR 84.46 78.84 80.58 67.93 internal tagging. Adapting model 1 to French PARSEVAL F1 84.85 78.85 80.63 68.33 requires to design French specific head propaga- Unlab. dep. Prec. 90.23 85.74 87.60 79.50 tion rules. To this end, we adapted those de- Unlab. dep. Rec. 89.95 85.72 86.90 79.37 Unlab. dep. F1 90.09 85.73 87.25 79.44 scribed by (Dybro-Johansen, 2004) for extracting a Stochastic Tree Adjoining Grammar parser on Table 1: Results for parsers trained on FTB with French. And to adapt model 2, we have further minimal tagset designed French specific argument/adjunct identi- fication rules. For the BKY approach, we use the Berkeley 4The tagger is TNT (Brants, 2000), and the parser implementation, with an horizontal markovization is LNCKY, that is distributed by Mark Johnson (http://www.cog.brown.edu/∼mj/Software.htm). h=0, and 5 split/merge cycles. All the required Formally because of the tagger, this is not a strict PCFG knowledge is contained in the treebank used for setup. Rather, it gives a practical trade-off, in which the training, except for the treatment of unknown or tagger includes the lexical smoothing for unknown and rare words. rare words. It clusters unknown words using ty- 5For this evaluation, the gold constituent trees are con- pographical and morphological information. We verted into pseudo-gold dependency trees (that may con- tain errors). Then parsed constituent trees are converted 3(Petrov et al., 2006) obtain an F-score=90.1 for sentences into parsed dependency trees, that are matched against the of less than 40 words. pseudo-gold trees.

53 4.3 Impact of training data size PCFGs (see for instance parent-transformation of How do the unlexicalized and lexicalized ap- (Johnson, 1998), or various symbol refinements in proaches perform with respect to size? We com- (Klein and Manning., 2003b)). Lexicalization it- pare in figure 3 the parsing performance BKY and self can be seen as symbol refinements (with back- COLLINSM1, on increasingly large subsets of the off though). For BKY, though the key point is to FTB, in perfect tagging mode6 and using a more automatize symbol splits, it is interesting to study detailed tagset (CC tagset, described in the next whether manual splits still help. experiment). The same 1235-sentences test set We have thus experimented BKY training with is used for all subsets, and the development set’s various tagsets. The FTB contains rich mor- size varies along with the training set’s size. BKY phological information, that can be used to split outperforms the lexicalized model even with small preterminal symbols : main coarse category (there amount of data (around 3000 training sentences). are 13), subcategory (subcat feature refining the Further, the parsing improvement that would re- main cat), and inflectional information (mph fea- sult from more training data seems higher for BKY ture). than for Bikel. We report in Table 2 results for the four tagsets, where terminals are made of : MIN: main cat, SUBCAT: main cat + subcat feature, MAX: cat + subcat + all inflectional information, CC: cat + ver- bal mood + wh feature.

Tagset Nb of tags Parseval Unlab. dep Tagging F1 F1 Acc MIN 13 84.85 90.09 97.35 SUBCAT 34 85.74 – 96.63 MAX 250 84.13 – 92.20 F−score CC 28 86.41 90.99 96.83

Table 2: Tagset impact on learning with BKY (own

Bikel tagging) Berkeley 76 78 80 82 84 86 88 The corpus instantiation with CC tagset is our 2000 4000 6000 8000 10000

Number of training sentences best trade-off between tagset informativeness and obtained parsing performance7. It is also the best result obtained for French probabilistic parsing. Figure 3: Parsing Learning curve on FTB with CC- This demonstrates though that the BKY learning tagset, in perfect-tagging is not optimal since manual a priori symbol refine- ments significantly impact the results. This potential increase for BKY results if we We also tried to learn structures with functional had more French annotated data is somehow con- annotation attached to the labels : we obtain PAR- firmed by the higher results reported for BKY SEVAL F1=78.73 with tags from the CC tagset + training on the Penn Treebank (Petrov et al., 2006) grammatical function. This degradation, due to :F1=90.2. We can show though that the 4 points data sparseness and/or non local constraints badly increase when training on English data is not only captured by the model, currently constrains us to due to size : we extracted from the Penn Treebank use a language model without functional informa- a subset comparable to the FTB, with respect to tions. As stressed in the introduction, this limits number of tokens and average length of sentences. the interpretability of the parses and it is a trade- We obtain F1=88.61 with BKY training. off between generalization and interpretability. 4.4 Symbol refinements 4.5 Lexicon and Inflection impact It is well-known that certain treebank transfor- mations involving symbol refinements improve French has a rich morphology that allows some degree of word order variation, with respect to 6For BKY, we simulate perfect tagging by changing words into word+tag in training, dev and test sets. We ob- 7The differences are statistically significant : using a stan- tain around 99.8 tagging accuracy, errors are due to unknown dard t-test, we obtain p-value=0.015 between MIN and SUB- words. CAT, and p-value=0.002 between CC and SUBCAT.

54 English. For probabilistic parsing, this can have size is superior to approx. 3000 sentences8. contradictory effects : (i) on the one hand, this This suggests that only very frequent words induces more data sparseness : the occurrences matter, otherwise words’ impact should be more of a French regular verb are potentially split into and more important as training material augments. more than 60 forms, versus 5 for an English verb; (ii) on the other hand, inflection encodes agreements, that can serve as clues for syntactic attachments.

Experiment In order to measure the impact of inflection, we have tested to cluster word forms on a morphological basis, namely to partly cancel inflection. Using lemmas as word form Parseval F−score classes seems too coarse : it would not allow to Bky terminal=form+tag Bky terminal=lemma+tag distinguish for instance between a finite verb and Bky terminal=tag a participle, though they exhibit different distri- butional properties. Instead we use as word form 76 78 80 82 84 86 88 classes, the couple lemma + syntactic category. 0 2000 4000 6000 8000 10000

For example for verbs, given the CC tagset, this Number of training sentences amounts to keeping 6 different forms (for the 6 moods). Figure 4: Impact of clustering word forms (train- To test this grouping, we derive a treebank where ing on FTB with CC-tagset, in perfect-tagging) words are replaced by the concatenation of lemma + category for training and testing the parser. Since it entails a perfect tagging, it has to be 5 Related Work compared to results in perfect tagging mode : more precisely, we simulate perfect tagging Previous works on French probabilistic parsing are by replacing word forms by the concatenation those of (Arun and Keller, 2005), (Schluter and form+tag. van Genabith, 2007), (Schluter and van Genabith, Moreover, it is tempting to study the impact of 2008). One major difficulty for comparison is that a more drastic clustering of word forms : that of all three works use a different version of the train- using the sole syntactic category to group word ing corpus. Arun reports results on probabilistic forms (we replace each word by its tag). This parsing, using an older version of the FTB and us- amounts to test a pure unlexicalized learning. ing lexicalized models (Collins M1 and M2 mod- els, and the bigram model). It is difficult to com- pare our results with Arun’s work, since the tree- Discussion Results are shown in Figure 4. bank he has used is obsolete (FTB-V0). He obtains We make three observations : First, comparing for Model 1 : LR=80.35 / LP=79.99, and for the the terminal=tag curves with the other two, it bigram model : LR=81.15 / LP=80.84, with min- appears that the parser does take advantage of imal tagset and internal tagging. The results with lexical information to rank parses, even for this FTB (revised subset of FTB-V0) with minimal “unlexicalized” algorithm. Yet the relatively small increase clearly shows that lexical information 8 This is true for all points in the curves, except for remains underused, probably because of lexical the last step, i.e. when full training set is used. We per- formed a 10-fold cross validation to limit sample effects. For data sparseness. the BKYtraining with CC tagset, and own tagging, we ob- Further, comparing terminal=lemma+tag and ter- tain an average F-score of 85.44 (with a rather high stan- dard deviation σ=1.14). For the clustering word forms ex- minal=form+tag curves, we observe that grouping periment, using the full training set, we obtain : 86.64 for words into lemmas helps reducing this sparseness. terminal=form+tag (σ=1.15), 87.33 for terminal=lemma+tag And third, the lexicon impact evolution (i.e. (σ=0.43), and 85.72 for terminal=tag (σ=0.43). Hence our conclusions (words help even with unlexicalized algorithm, the increment between terminal=tag and termi- and further grouping words into lemmas helps) hold indepen- nal=form+tag curves) is stable, once the training dently of sampling.

55 tagset (Table 1) are comparable for COLLINSM1, table 2), it is indeed very high if training data size and nearly 5 points higher for BKY. is taken into account (cf. the BKY learning curve in figure 3). This good result raises the open ques- It is also interesting to review (Arun and Keller, tion of identifying which modifications in the MFT 2005) conclusion, built on a comparison with the (error mining and correction, tree transformation, German situation : at that time lexicalization was symbol refinements) have the major impact. thought (Dubey and Keller, 2003) to have no siz- able improvement on German parsing, trained on 6 Conclusion the Negra treebank, that uses a flat structures. So (Arun and Keller, 2005) conclude that since lex- This paper reports results in statistical parsing icalization helps much more for parsing French, for French with both unlexicalized (Petrov et al., with a flat annotation, then word-order flexibility 2006) and lexicalized parsers. To our knowledge, is the key-factor that makes lexicalization useful both results are state of the art on French for each (if word order is fixed, cf. French and English) paradigm. and useless (if word order is flexible, cf. German). Both algorithms try to overcome PCFG’s sim- This conclusion does not hold today. First, it can plifying assumptions by some specialization of the be noted that as far as word order flexibility is con- grammatical labels. For the lexicalized approach, cerned, French stands in between English and Ger- the annotation of symbols with lexical head is man. Second, it has been proven that lexicalization known to be rarely fully used in practice (Gildea, helps German probabilistic parsing (Kübler et al., 2001), what is really used being the category of 2006). Finally, these authors show that markoviza- the lexical head. tion of the unlexicalized Stanford parser gives al- We observe that the second approach (BKY) most the same increase in performance than lex- constantly outperforms the lexicalist strategy à la icalization, both for the Negra treebank and the (Collins, 1999). We observe however that (Petrov Tüba-D/Z treebank. This conclusion is reinforced et al., 2006)’s semi-supervised learning procedure by the results we have obtained : the unlexicalized, is not fully optimal since a manual refinement of markovized, PCFG-LA algorithm outperforms the the treebank labelling turns out to improve the Collins’ lexicalized model. parsing results. Finally we observe that the semi-supervised (Schluter and van Genabith, 2007) aim at learn- BKY algorithm does take advantage of lexical in- ing LFG structures for French. To do so, and in formation : removing words degrades results. The order to learn first a Collins parser, N. Schluter preterminal symbol splits percolates lexical dis- created a modified treebank, the MFT, in order (i) tinctions. Further, grouping words into lemmas to fit her underlying theoretical requirements, (ii) helps for a morphologically rich language such as to increase the treebank coherence by error min- French. So, an intermediate clustering standing ing and (iii) to improve the performance of the between syntactic category and lemma is thought learnt parser. The MFT contains 4739 sentences to yield better results in the future. taken from the FTB, with semi-automatic trans- formations. These include increased rule stratifi- 7 Acknowledgments cation, symbol refinements (for information prop- agation), coordination raising with some manual We thank N. Schluter and J. van Genabith for re-annotation, and the addition of functional tags. kindly letting us run BKY on the MFT, and A. MFT has also undergone a phase of error min- Arun for answering our questions. We also thank ing, using the (Dickinson and Meurers, 2005) soft- the reviewers for valuable comments and refer- ware, and following manual correction. She re- ences. The work of the second author was partly ports a 79.95% F-score on a 400 sentence test funded by the “Prix Diderot Innovation 2007”, set, which compares almost equally with Arun’s from University Paris 7. results on the original 20000 sentence treebank. So she attributes her results to the increased co- herence of her smaller treebank. Indeed, we ran the BKY training on the MFT, and we get F- score=84.31. While this is less in absolute than the BKY results obtained with FTB (cf. results in

56 References Mark Johnson. 1998. PCFG models of linguis- tic tree representations. Computational Linguistics, Anne Abeillé, Lionel Clément, and Alexandra Kinyon. 24(4):613–632. 2000. Building a treebank for french. In Proceed- ings of the 2nd International Conference Language Dan Klein and Christopher D. Manning. 2003a. Ac- Resources and Evaluation (LREC’00). curate unlexicalized parsing. In Proceedings of the 41st Annual Meeting on Association for Computa- Anne Abeillé, Lionel Clément, and François Toussenel, tional Linguistics-Volume 1, pages 423–430. Asso- 2003. Building a treebank for French. Kluwer, Dor- ciation for Computational Linguistics Morristown, drecht. NJ, USA. Abhishek Arun and Frank Keller. 2005. Lexicalization Dan Klein and Christopher D. Manning. 2003b. Ac- in crosslinguistic probabilistic parsing: The case of curate unlexicalized parsing. In Proceedings of the french. In Proceedings of the 43rd Annual Meet- 41st Meeting of the Association for Computational ing of the Association for Computational Linguis- Linguistics. tics, pages 306–313, Ann Arbor, MI. Sandra Kübler, Erhard W. Hinrichs, and Wolfgang Daniel M. Bikel. 2004a. A distributional analysis Maier. 2006. Is it really that difficult to parse ger- of a lexicalized statistical parsing model. In Proc. man? In Proceedings of the 2006 Conference on of Empirical Methods in Natural Language Pro- Empirical Methods in Natural Language Process- cessing (EMNLP 2004), volume 4, pages 182–189, ing, pages 111–119, Sydney, Australia, July. Asso- Barcelona, Spain. ciation for Computational Linguistics. Daniel M. Bikel. 2004b. Intricacies of Collins’ Parsing Model. Computational Linguistics, 30(4):479–511. Dekang Lin. 1995. A dependency-based method for evaluating broad-coverage parsers. In International Thorsten Brants. 2000. Tnt – a statistical part-of- Joint Conference on Artificial Intelligence, pages speech tagger. In Proceedings of the 6th Applied 1420–1425, Montreal. NLP Conference (ANLP), Seattle-WA. Takuya Matsuzaki, Yusuke Miyao, and Jun’ichi Tsujii. Eugene Charniak. 2000. A maximum-entropy- 2005. Probabilistic cfg with latent annotations. In inspired parser. In Proceedings of the Annual Meet- Proceedings of the 43rd Annual Meeting of the Asso- ing of the North American Association for Com- ciation for Computational Linguistics (ACL), pages putational Linguistics (NAACL-00), pages 132–139, 75–82. Seattle, Washington. David McClosky, Eugene Charniak, and Mark John- Michael Collins. 1999. Head driven statistical models son. 2006. Effective self-training for parsing. In for natural language parsing. Ph.D. thesis, Univer- Proceedings of the Human Language Technology sity of Pennsylvania, Philadelphia. Conference of the NAACL, Main Conference, pages 152–159, New York City, USA, June. Association Benoit Crabbé and Marie Candito. 2008. Expériences for Computational Linguistics. d’analyse syntaxique statistique du français. In Actes de la 15ème Conférence sur le Traitement Au- Slav Petrov, Leon Barrett, Romain Thibaux, and Dan tomatique des Langues Naturelles (TALN’08), pages Klein. 2006. Learning accurate, compact, and 45–54, Avignon. interpretable tree annotation. In Proceedings of the 21st International Conference on Computational Markus Dickinson and W. Detmar Meurers. 2005. Linguistics and 44th Annual Meeting of the Associ- Prune diseased branches to get healthy trees! how ation for Computational Linguistics, Sydney, Aus- to find erroneous local trees in treebank and why tralia, July. Association for Computational Linguis- it matters. In Proceedings of the 4th Workshop tics. on Treebanks and Linguistic Theories (TLT 2005), Barcelona, Spain. Natalie Schluter and Josef van Genabith. 2007. Preparing, restructuring, and augmenting a french Amit Dubey and Frank Keller. 2003. Probabilis- treebank: Lexicalised parsers or coherent treebanks? tic parsing for german using sister-head dependen- In Proceedings of PACLING 07. cies. In In Proceedings of the 41st Annual Meet- ing of the Association for Computational Linguis- Natalie Schluter and Josef van Genabith. 2008. tics, pages 96–103. Treebank-based acquisition of lfg parsing resources for french. In European Language Resources As- Ane Dybro-Johansen. 2004. Extraction automatique sociation (ELRA), editor, Proceedings of the Sixth de grammaires á partir d’un corpus français. Mas- International Language Resources and Evaluation ter’s thesis, Université Paris 7. (LREC’08), Marrakech, Morocco, may. Daniel Gildea. 2001. Corpus variation and parser per- formance. In Proceedings of the First Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 167–202.

57 Upper Bounds for Unsupervised Parsing with Unambiguous Non-Terminally Separated Grammars

Franco M. Luque and Gabriel Infante-Lopez Grupo de Procesamiento de Lenguaje Natural Universidad Nacional de Córdoba & CONICET Argentina {francolq|gabriel}@famaf.unc.edu.ar

Abstract underlying language. Moreover, UNTS grammars have been successfully used to induce grammars Unambiguous Non-Terminally Separated from unannotated corpora in competitions of (UNTS) grammars have properties that learnability of formal languages (Clark, 2007). make them attractive for grammatical in- UNTS grammars can be used for modeling nat- ference. However, these properties do not ural language. They can be induced using any state the maximal performance they can training material, the induced models can be eval- achieve when they are evaluated against a uated using trees from a treebank, and their per- gold treebank that is not produced by an formance can be compared against state-of-the- UNTS grammar. In this paper we inves- art unsupervised models. Different learning al- tigate such an upper bound. We develop gorithms might produce different grammars and, a method to find an upper bound for the consequently, different scores. The fact that the unlabeled F 1 performance that any UNTS class of UNTS grammars is PAC learnable does grammar can achieve over a given tree- not convey any information on the possible scores bank. Our strategy is to characterize all that different UNTS grammars might produce. possible versions of the gold treebank that From a performance oriented perspective it might UNTS grammars can produce and to find be possible to have an upper bound over the set the one that optimizes a metric we define. of possible scores of UNTS grammars. Knowing We show a way to translate this score into an upper bound is complementary to knowing that an upper bound for the F 1. In particular, the class of UNTS grammars is PAC learnable. we show that the F 1 parsing score of any Such upper bound has to be defined specifically UNTS grammar can not be beyond 82.2% for UNTS grammars and has to take into account when the gold treebank is the WSJ10 cor- the treebank used as test set. The key question pus. is how to compute it. Suppose that we want to evaluate the performance of a given UNTS gram- 1 Introduction mar using a treebank. The candidate grammar pro- Unsupervised learning of natural language has re- duces a tree for each sentence and those trees are ceived a lot of attention in the last years, e.g., Klein compared to the original treebank. We can think and Manning (2004), Bod (2006a) and Seginer that the candidate grammar has produced a new (2007). Most of them use sentences from a tree- version of the treebank, and that the score of the bank for training and trees from the same treebank grammar is a measure of the closeness of the new for evaluation. As such, the best model for un- treebank to the original treebank. Finding the best supervised parsing is the one that reports the best upper bound is equivalent to finding the closest performance. UNTS version of the treebank to the original one. Unambiguous Non-Terminally Separated Such bounds are difficult to find for most classes (UNTS) grammars have properties that make of languages because the search space is the them attractive for grammatical inference. These set of all possible versions of the treebank that grammars have been shown to be PAC-learnable might have been produced by any grammar in the in polynomial time (Clark, 2006), meaning that class under study. In order to make the problem under certain circumstances, the underlying tractable, we need the formalism to have an easy grammar can be learned from a sample of the way to characterize all the versions of a treebank

Proceedings of the EACL 2009 Workshop on Computational Linguistic Aspects of Grammatical Inference, pages 58–65, Athens, Greece, 30 March 2009. c 2009 Association for Computational Linguistics

58 ∗ it might produce. UNTS grammars have a special have that X ⇒ αY γ (Clark, 2007). Unambiguous characterization that makes the search space easy NTS (UNTS) grammars are those NTS grammars to define but whose exploration is NP-hard. that parses unambiguously every instance of the In this paper we present a way to characterize language. UNTS grammars and a metric function to mea- Given any grammar G, a substring s of r ∈ sure the closeness between two different version L(G) is called a constituent of r if and only if there ∗ ∗ of a treebank. We show that the problem of find- is an X in N such that S ⇒ uXv ⇒ usv = r. ing the closest UNTS version of the treebank can In contrast, a string s is called a non-constituent or be described as Maximum Weight Independent Set distituent of r ∈ L(G) if s is not a constituent of r. (MWIS) problem, a well known NP-hard problem We say that s is a constituent of a language L(G) (Karp, 1972). The exploration algorithm returns if for every r that contains s, s is a constituent of a version of the treebank that is the closest to the r. In contrast, s is a distituent of L(G) if for every gold standard in terms of our own metric. r where s occurs, s is a distituent of r. We show that the F 1-measure is related to our An interesting characterization of finite UNTS measure and that it is possible to find and upper grammars is that every substring that appear in bound of the F 1-performance for all UNTS gram- some string of the language is always a constituent mars. Moreover, we compute this upper bound for or always a distituent. In other words, if there is a the WSJ10, a subset of the Penn Treebank (Mar- string r in L(G) for which s is a constituent, then cus et al., 1994) using POS tags as the alphabet. s is a constituent of L(G). By means of this prop- The upper bound we found is 82.2% for the F 1 erty, if we ignore the non-terminal labels, a finite measure. Our result suggest that UNTS grammars UNTS language is fully determined by its set of are a formalism that has the potential to achieve constituents C. We can show this property for fi- state-of-the-art unsupervised parsing performance nite UNTS languages. We believe that it can also but does not guarantee that there exists a grammar be shown for non-finite cases, but for our purposes that can actually achieve the 82.2%. the finite cases suffices, because we use grammars To the best of our knowledge, there is no pre- to parse finite sets of sentences, specifically, the vious research on finding upper bounds for perfor- sentences of test treebanks. We know that for ev- mance over a concrete class of grammars. In Klein ery finite subset of an infinite language produced and Manning (2004), the authors compute an up- by a UNTS grammar G, there is a UNTS gram- per bound for parsing with binary trees a gold tree- mar G′ whose language is finite and that parses bank that is not binary. This upper bound, that is the finite subset as G. If we look for the upper 88.1% for the WSJ10, is for any parser that returns bound among the grammars that produce a finite binary trees, including the concrete models devel- language, this upper bound is also an upper bound oped in the same work. But their upper bound does for the class of infinite UNTS grammars. not use any specific information of the concrete models that may help them to find better ones. The UNTS characterization plays a very im- The rest of the paper is organized as follows. portant role in the way we look for the upper Section 2 presents our characterization of UNTS bound. Our method focuses on how to determine grammars. Section 3 introduces the metric we op- which of the constituents that appear in the gold timized and explains how the closest version of the are actually the constituents that produce the up- treebank is found. Section 4 explains how the up- per bound. Suppose that a given gold treebank per bound for our metric is translated to an up- contains two strings α and β such that they occur per bound of the F 1 score. Section 5 presents our overlapped. That is, there exist non-empty strings α′,γ,β′ such that α = α′γ and β = γβ′ and bound for UNTS grammars using the WSJ10 and ′ ′ finally Section 6 concludes the paper. α γβ occurs in the treebank. If C is the set of constituents of a UNTS grammar it can not have 2 UNTS Grammars and Languages both α and β. It might have one or the other, but if both belong to C the resulting language can not Formally, a context free grammar G = be UNTS. In order to find the closest UNTS gram- (Σ,N,S,P ) is said to be Non-Terminally Sepa- mar we design a procedure that looks for the sub- rated (NTS) if, for all X,Y ∈ N and α,β,γ ∈ set of all substrings that occur in the sentences of ∗ ∗ (Σ ∪ N)∗ such that X ⇒ αβγ and Y ⇒ β, we the gold treebank that can be the constituent set C

59 of a grammar. We do not explicitly build a UNTS ric, and we show that the possible values of F 1 grammar, but find the set C that produces the best can be bounded by a function that takes this score score. as argument. In this section we present our metric We say that two strings α and β are compatible and the technique we use to find a grammar that in a language L if they do not occur overlapped reports the best value for our metric. in L, and hence they both can be members of C. If the original treebank T is not produced by any If we think of L as a subset of an infinite lan- UNTS grammar, then there are strings in T that guage, it is not possible to check that two overlap- are constituents in some sentences and that are dis- ping strings do not appear overlapped in the “real” tituents in some other sentences. For each one of language and hence that they are actually com- them we need a procedure to decide whether they patible. Nevertheless, we can guarantee compat- are members of C or not. If a string α appears a ibility between two strings α,β by requiring that significant number of times more as a constituent they do not overlap at all, this is, that there are than as a distituent the procedure may choose to ′ ′ ′ no non-empty strings α ,γ,β such that α = α γ include it in C at the price of being wrong a few ′ and β = γβ . We call this type of compatibility times. That is, the new version of T has all occur- strong compatibility. Strong compatibility ensures rences of α either as constituents or as distituents. that two strings can belong to C regardless of L. The treebank that has all of its occurrences as con- In our experiments we focus on finding the best set stituents differs from the original in that there are C of compatible strings. some occurrences of α that were originally dis- Any set of compatible strings C extracted from tituents and are marked as constituents. Similarly, the gold treebank can be used to produce a new if α is marked as distituent in the new treebank, it version of the treebank. For example, Figure 1 has occurrences of α that were constituents in T . shows two trees from the WSJ Penn Treebank. The decision procedure becomes harder when The string “in the dark” occurs as a constituent in all the substrings that appear in the treebank are (a) and as a distituent in (b). If C contains “in the considered. The increase in complexity is a con- dark”, it can not contain “the dark clouds” given sequence of the number of decisions the procedure that they overlap in the yield of (b). As a con- needs to take and the way these decisions interfere sequence, the new treebank correctly contains the one with another. We show that the problem of subtree in (a) but not the one in (b). Instead, the determining the set C is naturally embedded in a yield of (b) is described as in (c) in the new tree- graph NP-hard problem. We define a way to look bank. for the optimal grammars by translating our prob- C defines a new version of the treebank that sat- lem to a well known graph problem. Let L be the isfies the UNTS property. Our goal is to obtain a ′ ′ the set of sentences in a treebank, and let S(L) be treebank T such that (a) T and T are treebanks all the possible non-empty proper substrings of L. over the same set of sentences, (b) T ′ is UNTS, ′ We build a weighted undirected graph G in terms and (c) T is the closest treebank to T in terms of of the treebank as follows. Nodes in G correspond performance. The three of them imply that any to strings in S(L). The weight of a node is a func- other UNTS grammar is not as similar as the one tion w(s) that models our interest of having s se- we found. lected as a constituent; w(s) is defined in terms of some information derived from the gold treebank 3 Finding the Best UNTS Grammar T and we discuss it later in this section. Finally, As our goal is to find the closest grammar in terms two nodes a and b are connected by an edge if their of performance, we need to define first a weight two corresponding strings conflict in a sentence of for each possible grammar and second, an algo- T (i.e., they are not compatible in L). rithm that searches for the grammar with the best Not all elements of L are in S(L). We did not weight. Ideally, the weight of a candidate gram- include L in S(L) for two practical reasons. The mar should be in terms of F 1, but we can show first one is that to require L in S(L) is too re- that optimization of this particular metric is com- strictive. It states that all strings in L are in fact putationally hard. Instead of defining F 1 as their constituents. If two string ab and bc of L oc- score, we introduce a new metric that is easier to cur overlapped in a third string abc then there is optimize, we find the best grammar for this met- no UNTS grammar capable of having the three of

60 (a) (b) (c) IN NNS PRP VBP IN DT JJ IN in DT JJ NNS clouds we DT JJ ’re in the dark in the dark clouds the dark Figure 1: (a) and (b) are two subtrees that show “in the dark” as a constituent and as a distituent respec- tively. (c) shows the result of choosing “in the dark” as a constituent. them as constituents. The second one is that in- imize c(s) and minimize d(s) at the same time. cluding them produces graphs that are too sparse. This can be done by defining the contribution of a If they are included in the graph, we know that string s to the overall score as any solution should contain them, consequently, all their neighbors do not belong to any solution w(s)= c(s) − d(s). and they can be removed from the graph. Our ex- With this definition of w, the weight W (C) = periments show that the graph that results from re- w(s) becomes the number of constituents moving nodes related to nodes representing strings s∈C of T ′ that are in T minus the number of con- in L are too small to produce any interesting result. P stituents that do not. If we define the number of By means of representing the treebank as a hits to be H(C) = c(s) and the number of graph, selecting a set of constituents C ⊆ S(L) s∈C misses to be M(C)=P ∈ d(s) we have that is equivalent to selecting an independent set of s C P nodes in the graph. An independent set is a sub- W (C)= H(C) − M(C). (1) set of the set of nodes that do not have any pair of nodes connected by an edge. Clearly, there are As we confirm in Section 5, graphs tend to be exponentially many possible ways to select an in- very big. In order to reduce the size of the graphs, dependent set, and each of these sets represents a if a string s has w(s) ≤ 0, we do not include its set of constituents. But, since we are interested in corresponding node in the graph. An independent the best set of constituents, we associate to each set that does not include s has an equal or higher independent set C the weight W (C) defined as W than the same set including s. s∈C w(s). Our aim is then to find a set Cmax For example, let T be the treebank in Fig- thatP maximizes this weight. This problem is a well ure 2 (a). The sets of substrings such that known problem of graph theory known in the lit- w(c) ≥ 0 is {da,cd,bc,cda,ab,bch}. The erature as the Maximum Weight Independent Set graph that corresponds to this set of strings is (MWIS) problem. This problem is also known to given in Figure 3. Nodes corresponding to be NP-hard (Karp, 1972). strings {dabch,bcda,abe,abf,abg,bci,daj} are We still have to choose a definition for w(s). not shown in the figure because the strings do We want to find the grammar that maximizes F 1. not belong to S(L). The figure also shows the Unfortunately, F 1 can not be expressed in terms of weights associated to the substrings according to a sum of weights. Maximization of F 1 is beyond their counts in Figure 2 (a). The shadowed nodes the expressiveness of our model, but our strategy correspond to the independent set that maximizes is to define a measure that correlates with F 1 and W . The trees in the Figure 2 (b) are the sentences that can be expressed as a sum of weights. of the treebank parsed according the optimal inde- In order to introduce our measure, we first de- pendent set. fine c s and d s as the number of times a string ( ) ( ) 4 An Upper Bound for F1 s appears in the gold treebank T as a constituent and as a distituent respectively. Observe that if Even though finding the independent set that max- we choose to include s as a constituent of C, the imizes W is an NP-Hard problem, there are in- resulting treebank T ′ contains all the c(s)+ d(s) stances where it can be effectively computed, as occurrences of s as a constituent. c(s) of the s oc- we show in the next section. The set Cmax max- currences in T ′ are constituents as they are in T imizes W for the WSJ10 and we know that all and d(s) of the occurrences are constituents in T ′ others C produces a lower value of W . In other but are in fact distituents in T . We want to max- words, the set Cmax produce a treebank Tmax that

61 (a)

b e f g i j d a a a b a b a b b c d a h c d b c (ab)e (ab)f (ab)g (bc)i (da)j (da)((bc)h) b((cd)a)

(b)

b e f g d c h a a b a b a b b c i d a j a b c d (ab)e (ab)f (ab)g bci daj d(ab)ch b((cd)a) Figure 2: (a) A gold treebank. (b) The treebank generated by the grammar C = L ∪ {cd,ab,cda}.

F 1-weights for all possible UNTS grammars re- spectively. Then, if w is an upper bound of X, then f(w) is an upper bound of Y . The function f is defined as follows: 1 f(w)= F 1 w , 1 (2) 2 − K  Figure 3: Graph for the treebank of Figure 2. 2pr where F 1(p,r) = p+r , and K = s∈ST c(s) is the total number of constituents inP the gold tree- bank T . From it, we can also derive values for is the closest UNTS version to the WSJ10 in terms 1 precision and recall: precision − w and recall 1. of W . We can compute the precision, recall and 2 K A recall of 1 is clearly an upper bound for all the F 1 for Cmax but there is no warranty that the F 1 possible values of recall, but the value given for score is the best for all the UNTS grammars. This precision is not necessarily an upper bound for all is the case because F 1 and W do not define the the possible values of precision. It might exist a same ordering over the family of candidate con- grammar having a higher value of precision but stituent sets C: there are gold treebanks T (used whose F 1 has to be below our upper bound. for computing the metrics), and sets C1, C2 such The rest of section shows that f(W ) is an up- that F 1(C ) < F 1(C ) and W (C ) > W (C ). 1 2 1 2 per bound for F , the reader not interested in the For example, consider the gold treebank T in Fig- 1 technicalities can skip it. ure 4 (a). The table in Figure 4 (b) displays two The key insight for the proof is that both metrics sets C1 and C2, the treebanks they produce, and F 1 and W can be written in terms of precision and their values of F 1 and W . Note that C is the re- 2 recall. Let T be the treebank that is used to com- sult of adding the string ef to C , also note that 1 pute all the metrics. And let T ′ be the treebank c(ef) = 1 and d(ef) = 2. This improves the F 1 produced by a given constituent set C. If a string score but produces a lower W . s belongs to C, then its c(s)+ d(s) occurrences The F 1 measure we work with is the one de- ′ in T are marked as constituents. Moreover, s is fined in the recent literature of unsupervised pars- correctly tagged a c(s) number of times while it ing (Klein and Manning, 2004). F 1 is defined in is incorrectly tagged a d(s) number of times. Us- terms of Precision and Recall as usual, and the last ing this, P , R and F 1 can be computed for C as two measures are micro-averaged measures that follows: include full-span brackets, and that ignore both P c(s) unary branches and brackets of span one. For sim- P (C)= s∈C Ps∈C c(s)+d(s) plicity, the previous example does not count the H(C) (3) full-span brackets. = H(C)+M(C) As the example shows, the upper bound for W Ps∈C c(s) R(C)= K might not be an upper bound of F 1, but it is pos- = H(C) (4) sible to find a way to define an upper bound of K 2P (C)R(C) F 1 using the upper bound of W . In this section F 1(C)= P (C)+R(C) we define a function f with the following prop- = 2H(C) erty. Let X and Y be the sets of W -weights and K+H(C)+M(C)

62 (a)

c a g a b b d e f e f h e f i (ab)c a(bd) (ef)g efh efi

(b) ′ C TC P R F 1 W C1 = {abc, abd, efg, efh, efi, ab} {(ab)c, (ab)d, efg, efh, efi} 50% 33% 40% 1 − 1 = 0 C2 = {abc, abd, efg, efh, efi, ab, ef} {(ab)c, (ab)d, (ef)g, (ef)h, (ef)i} 40% 67% 50% 2 − 3 = −1

Figure 4: (a) A gold treebank. (b) Two grammars, the treebanks they generate, and their scores.

W can also be written in terms of P and R as fixed weight W = wC . The function is monoton- 1 ically increasing in r, so we can apply it to both W (C) = (2 − )R(C)K (5) sides of the following inequality r ≤ 1, which is P (C) C trivially true. As result, we get f1C ≤ f(wC ) as This formula is proved to be equivalent to Equa- required. The second inequality is proved by ob- tion (1) by replacing P (C) and R(C) with equa- serving that f(w) is monotonically increasing in tions (3) and (4) respectively. Using the last two w, and by applying it to both sides of the hypothe- equations, we can rewrite F 1 and W taking p and sis wc ≤ w. r, representing values of precision and recall, as parameters: 5 UNTS Bounds for the WSJ10 Treebank 2pr In this section we focus on trying to find real upper F 1(p,r)= p + r bounds building the graph for a particular treebank 1 T . We find the best independent set, we build the W (p,r) = (2 − )rK (6) UNTS version T of T and we compute the up- p max per bound for F 1. The treebank we use for exper- Using these equations, we can prove that f iments is the WSJ10, which consists of the sen- correctly translates upper bounds of W to upper tences of the WSJ Penn Treebank whose length bounds of F 1 using calculus. In contrast to F 1, is at most 10 words after removing punctuation W not necessarily take values between 0 and 1. In- marks (Klein and Manning, 2004). We also re- stead, it takes values between K and −∞. More- moved lexical entries transforming POS tags into 1 over, it is negative when p < 2 , and goes to −∞ our terminal symbols as it is usually done (Klein when p goes to 0. Let C be an arbitrary UNTS and Manning, 2004; Bod, 2006a). grammar, and let pC , rC and wC be its precision, We start by finding the best independent set. To recall and W -weight respectively. Let w be our solve the problem in the practice, we convert it upper bound, so that wC ≤ w. If f1C is defined into an Integer Linear Programming (ILP) prob- as F 1(pC ,rC ) we need to show that f1C ≤ f(w). lem. ILP is also NP-hard (Karp, 1972), but there We bound f1C in two steps. First, we show that is software that implements efficient strategies for solving some of its instances (Achterberg, 2004). f1C ≤ f(wC ) ILP problems are defined by three parameters. First, there is a set of variables that can take val- and second, we show that ues from a finite set. Second, there is an objective function that has to be maximized, and third, there f(wC ) ≤ f(w). is a set of constraints that must be satisfied. In our The first inequality is proved by observing that case, we define a binary variable xs ∈ {0, 1} for f1C and f(wC ) are the values of the function every node s in the graph. Its value is 1 or 0, that respectively determines the presence or absence of 1 s in the set Cmax. The objective function is f1(r)= F 1 wC ,r 2 − Kr  xsw(s) at the points r = rC and r = 1 respectively. s∈XS(L) This function corresponds to the line defined by the F 1 values of all possible models that have a The constraints are defined using the edges of the

63 graph. For every edge (s1, s2) in the graph, we Gold constituents K 35302 add the following constraint to the problem: Strings |S(L)| 68803 Nodes 7029 xs1 + xs2 ≤ 1 Edges 1204

The 7422 trees of the WSJ10 treebank have a Table 1: Figures for the WSJ10 and its graph. total of 181476 substrings of length ≥ 2, that form the set S(L) of 68803 different substrings. Hits H 22169 The number of substrings in S(L) does not grow Misses M 2127 too much with respect to the number of strings in Weight W 20042 L because substrings are sequences of POS tags, Precision P 91.2% meaning that each substring is very frequent in the Recall R 62.8% corpus. If substrings were made out of words in- F1 F 1 74.4% stead of POS tags, the number of substrings would grow much faster, making the problem harder to Table 2: Summary of the scores for C . solve. Moreover, removing the strings s such that max w(s) ≤ 0 gives a total of only 7029 substrings. Since there is a node for each substring, the result- Table 3 shows results that allow us to com- ing graph contains 7029 nodes. Recall that there pare the upper bounds with state-of-the-art pars- is an edge between two strings if they occur over- ing scores. BestW corresponds to the scores of lapped. Our graph contains 1204 edges. The ILP Tmax and UBoundF1 is the result of our transla- version has 7029 variables, 1204 constraints and tion function f. From the table we can see that the objective function sums over 7029 variables. an unsupervised parser based on UNTS grammars These numbers are summarized in Table 1. may reach a sate-of-the-art performance over the The solution of the ILP problem is a set of WSJ10. RBranch is a WSJ10 version where all 6583 variables that are set to one. This set corre- trees are binary and right branching. DMV, CCM sponds to a set Cmax of nodes in our graph of the and DMV+CCM are the results reported in Klein same number of elements. Using Cmax we build and Manning (2004). U-DOP and UML-DOP a new version Tmax of the WSJ10, and compute are the results reported in Bod (2006b) and Bod its weight W , precision, recall and F 1. Their val- (2006a) respectively. Incremental refers to the re- ues are displayed in Table 2. Since the elements sults reported in Seginer (2007). of L were not introduced in S(L), elements of L We believe that our upper bound is a generous are not necessarily in Cmax, but in order to com- one and that it might be difficult to achieve it for pute precision and recall, we add them by hand. two reasons. First, since the WSJ10 corpus is Strictly speaking, the set of constituents that we a rather flat treebank, from the 68803 substrings use for building Tmax is Cmax plus the full span only 10% of them are such that c(s) > d(s). Our brackets. procedure has to decide among this 10% which We can, using equation (2), compute the up- of the strings are constituents. An unsupervised per bound of F 1 for all the possible scores of all method has to choose the set of constituents from UNTS grammars that use POS tags as alphabet: the set of all 68803 possible substrings. Second, we are supposing a recall of 100% which is clearly 1 too optimistic. We believe that we can find a f(wmax)= F 1 wmax , 1 = 82.2% 2 − K  tighter upper bound by finding an upper bound for recall, and by rewriting f in equation (2) in terms The precision for this upper bound is of the upper bound for recall. 1 It must be clear the scope of the upper bound P (w )= = 69.8% max wmax we found. First, note that it has been computed 2 − K over the WSJ10 treebank using the POS tags as while its recall is R = 100%. Note from the pre- the alphabet. Any other alphabet we use, like for vious section that P (wmax) is not an upper bound example words, or pairs of words and POS tags, for precision but just the precision associated to changes the relation of compatibility among the the upper bound f(wmax). substrings, making a completely different universe

64 Model UP UR F1 alphabet. RBranch 55.1 70.0 61.7 From a more abstract perspective, we intro- DMV 46.6 59.2 52.1 duced a different approach to assess the usefulness CCM 64.2 81.6 71.9 of a grammatical formalism. Usually, formalism DMV+CCM 69.3 88.0 77.6 are proved to have interesting learnability proper- U-DOP 70.8 88.2 78.5 ties such as PAC-learnability or convergence of a UML-DOP 82.9 probabilistic distribution. We present an approach Incremental 75.6 76.2 75.9 that even though it does not provide an effective BestW(UNTS) 91.2 62.8 74.4 way of computing the best grammar in an unsu- UBoundF1(UNTS) 69.8 100.0 82.2 pervised fashion, it states the upper bound of per- formance for all the class of UNTS grammars.

Table 3: Performance on the WSJ10 of the most Acknowledgments recent unsupervised parsers, and our upper bounds on UNTS. This work was supported in part by grant PICT 2006-00969, ANPCyT, Argentina. We would like of UNTS grammars. Second, our computation of to thank Pablo Rey (UDP, Chile) for his help the upper bound was not made for supersets of the with ILP, and Demetrio Martín Vilela (UNC, Ar- WSJ10. Supersets such as the entire Penn Tree- gentina) for his detailed review. bank produce bigger graphs because they contain longer sentences and various different sequences References of substrings. As the maximization of W is an NP-hard problem, the computational cost of solv- Tobias Achterberg. 2004. SCIP - a framework to in- tegrate Constraint and Mixed Integer Programming. ing bigger instances grows exponentially. A third Technical report. limitation that must be clear is about the models affected by the bound. The upper bound, and in Rens Bod. 2006a. An all-subtrees approach to unsu- general the method, is only applicable to the class pervised parsing. In Proceedings of COLING-ACL 2006. of formal UNTS grammars, with only some very slight variants mentioned in the previous sections. Rens Bod. 2006b. Unsupervised parsing with U-DOP. Just moving to probabilistic or weighted UNTS In Proceedings of CoNLL-X. grammars invalidates all the presented results. Alexander Clark. 2006. PAC-learning unambiguous NTS languages. In Proceedings of ICGI-2006. 6 Conclusions Alexander Clark. 2007. Learning deterministic con- We present a method for assessing the potential of text free grammars: The Omphalos competition. UNTS grammars as a formalism for unsupervised Machine Learning, 66(1):93–110. parsing of natural language. We assess their po- Richard M. Karp. 1972. Reducibility among com- tential by finding an upper bound of their perfor- binatorial problems. In R. E. Miller and J. W. mance when they are evaluated using the WSJ10 Thatcher, editors, Complexity of Computer Compu- treebank. We show that any UNTS grammars can tations, pages 85–103. Plenum Press. achieve at most 82.2% of F 1 measure, a value Dan Klein and Christopher D. Manning. 2004. comparable to most state-of-the-art models. In or- Corpus-based induction of syntactic structure: Mod- der to compute this upper bound we introduced els of dependency and constituency. In Proceedings of ACL 42. a measure that does not define the same ordering among UNTS grammars as the F 1, but that has Mitchell P. Marcus, Beatrice Santorini, and Mary A. the advantage of being computationally easier to Marcinkiewicz. 1994. Building a large annotated optimize. Our measure can be used, by means of corpus of english: The Penn treebank. Computa- tional Linguistics, 19(2):313–330. a translation function, to find an upper bound for F 1. We also showed that the optimization proce- Yoav Seginer. 2007. Fast unsupervised incremental dure for our metric maps into an NP-Hard prob- parsing. In Proceedings of ACL 45. lem, but despite this fact we present experimen- tal results that compute the upper bound for the WSJ10 when POS tags are treated as the grammar

65 Comparing learners for Boolean partitions: implications for morphological paradigms ∗

Katya Pertsova University of North Carolina, Chapel Hill [email protected]

Abstract the domain of cognitive science for which DNF’s have been used, e.g., concept learning. I also de- In this paper, I show that a problem of scribe two other learners proposed specifically for learning a morphological paradigm is sim- learning morphological paradigms. The first of ilar to a problem of learning a partition these learners, proposed by me, was designed to of the space of Boolean functions. I de- capture certain empirical facts about syncretism scribe several learners that solve this prob- and free variation in typological data (Pertsova, lem in different ways, and compare their 2007). The second learner, proposed by David basic properties. Adger, was designed as a possible explanation of another empirical fact - uneven frequencies of free 1 Introduction variants in paradigms (Adger, 2006). Lately, there has been a lot of work on acquir- In the last section, I compare the learners on ing paradigms as part of the word-segmentation some simple examples and comment on their mer- problem (Zeman, 2007; Goldsmith, 2001; Snover its and the key differences among the algorithms. et al., 2002). However, the problem of learning I also draw connections to other work, and discuss the distribution of affixes within paradigms as a directions for further empirical tests of these pro- function of their semantic (or syntactic) features is posals. much less explored to my knowledge. This prob- 2 The problem lem can be described as follows: suppose that the segmentation has already been established. Can Consider a problem of learning the distribution we now predict what affixes should appear in of inflectional morphemes as a function of some what contexts, where by a ‘context’ I mean some- set of features. Using featural representations, we thing quite general: some specification of seman- can represent morpheme distributions in terms of tic (and/or syntactic) features of the utterance. For a formula. The DNF formulas are commonly used example, one might say that the nominal suffix - for such algebraic representation. For instance, z in English (as in apple-z) occurs in contexts that given the nominal suffix -z mentioned in the in- involve plural or possesive nouns whose stems end troduction, we can assign to it the following rep- in a voiced segment. resentation: [(noun; +voiced]stem; +plural) ∨ In this paper, I show that the problem of learn- (noun; +voiced]stem; +possesive)]. Presum- ing the distribution of morphemes in contexts ably, features like [plural] or [+voiced] or ]stem specified over some finite number of features (end of the stem) are accessible to the learners’ is roughly equivalent to the problem of learn- cognitive system, and can be exploited during ing Boolean partitions of DNF formulas. Given the learning process for the purpose of “ground- this insight, one can easily extend standard DNF- ing” the distribution of morphemes.1 This way learners to morphological paradigm learners. I of looking at things is similar to how some re- show how this can be done on an example of searchers conceive of concept-learning or word- the classical k-DNF learner (Valiant, 1984). This 1Assuming an a priori given universal feature set, the insight also allows us to bridge the paradigm- problem of feature discovery is a subproblem of learning learning problem with other similar problems in morpheme distributions. This is because learning what fea- ture condition the distribution is the same as learning what This∗ paper ows a great deal to the input from Ed Stabler. features (from the universal set) are relevant and should be As usual, all the errors and shortcomings are entirely mine. paid attention to.

Proceedings of the EACL 2009 Workshop on Computational Linguistic Aspects of Grammatical Inference, pages 66–74, Athens, Greece, 30 March 2009. c 2009 Association for Computational Linguistics

66 learning (Siskind, 1996; Feldman, 2000; Nosofsky Alternatively, we could say that the learner has et al., 1994). to learn a partition of Boolean functions associated However, one prominent distinction that sets with each morph (a Boolean function for a morph inflectional morphemes apart from words is that m maps the contexts in which m occurs to true, they occur in paradigms, semantic spaces defin- and all other contexts to false). ing a relatively small set of possible distinctions. However, when paradigms contain free varia- In the absence of free variation, one can say that tion, the divisions created by the morphs no longer the affixes define a partition of this semantic space define a partition since a single context may be as- into disjoint blocks, in which each block is asso- sociated with more than one morph. (Free vari- ciated with a unique form. Consider for instance ation is attested in world’s languages, although a present tense paradigm of the verb “to be” in it is rather marginal (Kroch, 1994).) In case a standard English represented below as a partition paradigm contains free variation, it is still possible of the set of environments over the following fea- to represent it as a partition by doing the follow- tures: class (with values masc, fem, both (masc ing: & fem),inanim,), number (with values +sg and −sg), and person (with values 1st, 2nd, 3rd).2 (1) Take a singleton partition of morph- meaning pairs (m, r) and merge any cells that have the same meaning r. Then merge am 1st. person; fem; +sg. those blocks that are associated with the 1st. person; masc; +sg. same set of morphs. are 2nd. person; fem; +sg. 2nd. person; masc; +sg. Below is an example of how we can use this trick 2nd. person; fem; −sg. to partition a paradigm with free-variation. The 2nd. person; masc; −sg. data comes from the past tense forms of “to be” in 2nd. person; both; −sg. Buckie English. 1st. person; fem; −sg. 1st. person; masc; −sg. was 1st. person; fem; +sg. 1st. person; both; −sg. 1st. person; masc; +sg. 3rd. person; masc; −sg 3rd person; masc; +sg 3rd. person; fem; −sg 3rd person; fem; +sg 3rd. person; both; −sg 3rd person; inanim; +sg 3rd. person; inanim; −sg was/were 2nd. person; fem; +sg. is 3rd person; masc; +sg 2nd. person; masc; +sg. 3rd person; fem; +sg 2nd. person; fem; −sg. 3rd person; inanim; +sg 2nd. person; masc; −sg. 2nd. person; both; −sg. Each block in the above partition can be rep- 1st. person; fem; −sg. resented as a mapping between the phonological 1st. person; masc; −sg. form of the morpheme (a morph) and a DNF for- 1st. person; both; −sg. mula. A single morph will be typically mapped to were 3rd. person; masc; −sg a DNF containing a single conjunction of features 3rd. person; fem; −sg (called a monomial). When a morph is mapped 3rd. person; both; −sg to a disjunction of monomials (as the morph [-z] 3rd. person; inanim; −sg discussed above), we think of such a morph as a homonym (having more than one “meaning”). In general, then, the problem of learning the Thus, one way of defining the learning problem is distribution of morphs within a single inflectional in terms of learning a partition of a set of DNF’s. paradigm is equivalent to learning a Boolean par- tition. 2These particular features and their values are chosen just for illustration. There might be a much better way to repre- In what follows, I consider and compare several sent the distinctions encoded by the pronouns. Also notice learners for learning Boolean partitions. Some of that the feature values are not fully independent: some com- binations are logically ruled out (e.g. speakers and listeners these learners are extensions of learners proposed are usually animate entities). in the literature for learning DNFs. Other learners

67 were explicitly proposed for learning morphologi- One way of stating the learning problem is to cal paradigms. say that the learner has to learn a grammar for the We should keep in mind that all these learners target language L (we would then have to spec- are idealizations and are not realistic if only be- ify what this grammar should look like). Another cause they are batch-learners. However, because way is to say that the learner has to learn the lan- they are relatively simple to state and to under- guage mapping itself. We can do the latter by us- stand, they allow a deeper understanding of what ing Boolean functions to represent the mapping of properties of the data drive generalization. each morph to a set of environments. Depending on how we state the learning problem, we might 2.1 Some definitions get different results. For instance, it’s known that Assume a finite set of morphs, Σ, and a finite set some subsets of DNF’s are not learnable, while of features F . It would be convenient to think of the Boolean functions corresponding to them are morphs as chunks of phonological material cor- learnable (Valiant, 1984). Since I will use Boolean responding to the pronounced morphemes.3 Ev- functions for some of the learners below, I intro- ery feature f ∈ F is associated with some set duce the following notation. Let B be the set of 0 of values Vf that includes a value [∗], unspec- Boolean functions mapping elements of S to true ified. Let S be the space of all possible com- or false. For convenience, we say that bm corre- plete assignments over F (an assignment is a set sponds to a Boolean function that maps a set of en- {fi → Vf |∀fi ∈ F }). We will call those assign- vironments to true when they are associated with ments that do not include any unspecified features m in L, and to false otherwise. environments. Let the set S0 ⊆ S correspond to the set of environments. 3 Learning Algorithms It should be easy to see that the set S forms a 3.1 Learner 1: an extension of the Valiant Boolean lattice with the following relation among k-DNF learner the assignments, ≤ : for any two assignments a R 1 An observation that a morphological paradigm can and a , a ≤ a iff the value of every feature f 2 1 R 2 i be represented as a partition of environments in in a is identical to the value of f in a , unless f 1 i 2 i which each block corresponds to a mapping be- is unspecified in a . The top element of the lattice 2 tween a morph and a DNF, allows us to easily con- is an assignment in which all features are unspec- vert standard DNF learning algorithms that rely ified, and the bottom is the contradiction. Every on positive and negative examples into paradigm- element of the lattice is a monomial corresponding learning algorithms that rely on positive examples to the conjunction of the specified feature values. only. We can do that by iteratively applying any An example lattice for two binary features is given DNF learning algorithm treating instances of in- in Figure 1. put pairs like (m, e) as positive examples for m and as negative examples for all other morphs. Figure 1: A lattice for 2 binary features Below, I show how this can be done by ex- tending a k-DNF4 learner of (Valiant, 1984) to a paradigm-learner. To handle cases of free varia- tion we need to keep track of what morphs occur in exactly the same environments. We can do this by defining the partition Π on the input following the recipe in (1) (substituting environments for the variable r). The original learner learns from negative exam- ples alone. It initializes the hypothesis to the dis- A language L consists of pairs from Σ × S0. junction of all possible conjunctions of length at That is, the learner is exposed to morphs in differ- most k, and subtracts from this hypothesis mono- ent environments. mials that are consistent with the negative ex- amples. We will do the same thing for each 3However, we could also conceive of morphs as functions specifying what transformations apply to the stem without 4k-DNF formula is a formula with at most k feature val- much change to the formalism. ues in each conjunct.

68 W morph using positive examples only (as described values in ei, ei ∈ E (iii) let Dm be M, where a above), and forgoing subtraction in a cases of free- set s ∈ M iff s ∈ Mi, for some i. variation. The modified learner is given below. Learner 2 can now be stated as a learner that The following additional notation is used: Lex is is identical to Learner 1 except for the initial set- the lexicon or a hypothesis. The formula D is a ting of Lex. Now, Lex will be set to Lex := disjunction of all possible conjunctions of length {hm, Dmi| ∃hm, ei ∈ L(ti)}. at most k. We say that two assignments are con- Because this learner does not require enumer- sistent with each other if they agree on all specified ation of all possible monomials, but just those features. Following standard notation, we assume that are consistent with the positive data, it can that the learner is exposed to some text T that con- handle “polynomially explainable” subclass of sists of an infinite sequence of (possibly) repeating DNF’s (for more on this see (Kushilevitz and Roth, elements from L. tj is a finite subsequence of the 1996)). first j elements from T . L(tj) is the set of ele- ments in tj. 5 Learner 3: a learner biased towards monomial and elsewhere distributions Learner 1 (input: tj) Next, I present a batch version of a learner I pro- 1. set Lex := {hm, Di| ∃hm, ei ∈ posed based on certain typological observations L(tj)} and linguists’ insights about blocking. The typo- logical observations come from a sample of verbal hm, ei ∈ L(t ) 2. For each j , for each agreement paradigms (Pertsova, 2007) and per- m0 ¬∃ bl ∈ Π L(t ) s.t. block of j , sonal pronoun paradigms (Cysouw, 2003) show- hm, ei ∈ bl hm0, ei ∈ bl and : ing that majority of paradigms have either “mono- hm0, fi Lex hm0, f 0i replace in by mial” or “elsewhere” distribution (defined below). where f 0 is the result of removing Roughly speaking, a morph has a monomial dis- every monomial consistent with e. tribution if it can be described with a single mono- mial. A morph has an elsewhere distribution if This learner initially assumes that every morph this distribution can be viewed as a complement can be used everywhere. Then, when it hears one of distributions of other monomial or elsewhere- morph in a given environment, it assumes that no morphs. To define these terms more precisely I other morph can be heard in exactly that environ- need to introduce some additional notation. Let ment unless it already knows that this environment T e be the intersection of all environments in permits free variation (this is established in the x which morph x occurs (i.e., these are the invariant partition Π). features of x). This set corresponds to a least up- per bound of the environments associated with x in 4 Learner 2: the lattice hS, ≤Ri, call it lubx. Then, let the min- The next learner is an elaboration on the previous imal monomial function for a morph x, denoted learner. It differs from it in only one respect: in- mmx, be a Boolean function that maps an envi- stead of initializing lexical representations of ev- ronment to true if it is consistent with lubx and ery morph to be a disjunction of all possible mono- to false otherwise. As usual, an extension of a mials of length at most k, we initialize it to be the Boolean function, ext(b) is the set of all assign- disjunction of all and only those monomials that ments that b maps to true. are consistent with some environment paired with (2) Monomial distribution the morph in the language. This learner is simi- A morph x has a monomial distribution iff lar to the DNF learners that do something on both bx ≡ mmx. positive and negative examples (see (Kushilevitz and Roth, 1996; Blum, 1992)). The above definition states that a morph has a So, for every morph m used in the language, we monomial distribution if its invariant features pick define a disjunction of monomials Dm that can be out just those environments that are associated derived as follows. (i) Let Em be the enumeration with this morph in the language. More concretely, of all environments in which m occurs in L (ii) if a monomial morph always co-occurs with the let Mi correspond to a set of all subsets of feature feature +singular, it will appear in all singular en-

69 vironments in the language. The original learner I proposed is an incre- mental learner that calculates grammars similar (3) Elsewhere distribution to those proposed by linguists, namely grammars A morph x has an elsewhere distribution consisting of a lexicon and a filtering “blocking” b ≡ mm − (mm ∨ mm ∨ ... ∨ iff x x x1 x2 component. The version presented here is a sim- (mm )) x 6= x Σ xn for all i in . pler batch learner that learns a partition of Boolean 5 The definition above amounts to saying that a functions instead. Nevertheless, the main proper- morph has an elsewhere distribution if the envi- ties of the original learner are preserved: specifi- ronments in which it occurs are in the extension cally, a bias towards monomial and elsewhere dis- of its minimal monomial function minus the min- tributions. imal monomial functions of all other morphs. An To determine what kind of distribution a morph example of a lexical item with an elsewhere distri- has, I define a relation C. A morph m stands in a 0 bution is the present tense form are of the verb “to relation C to another morph m if ∃hm, ei ∈ L, be”, shown below. such that lubm0 is consistent with e. In other words, mCm0 if m occurs in any environment consistent with the invariant features of m0. Let Table 1: The present tense of “to be” in English C+ be a transitive closure of C. sg. pl 1p. am are Learner 3 (input: tj) 2p. are are 1. Let S(tj) be the set of pairs in tj containing 3p. is are monomial- or elsewhere-distribution morphs. 0 That is, hm, ei ∈ S(tj) iff ¬∃m such that Elsewhere morphemes are often described in mC+m0 and m0C+m. linguistic accounts by appealing to the notion of 2. Let O(t ) = t − S(t ) (the set of all other blocking. For instance, the lexical representation j j j pairs). of are is said to be unspecified for both person and number, and is said to be “blocked” by two 3. A pair hm, ei ∈ S is a least element of S other forms: am and is. My hypothesis is that iff ¬∃hm0, e0i ∈ (S − {hm, ei}) such that the reason why such non-monotonic analyses ap- m0C+m. pear so natural to linguists is the same reason for why monomial and elsewhere distributions are ty- 4. Given a hypothesis Lex, and for any expres- pologically common: namely, the learners (and, sion hm, ei ∈ Lex: let rem((m, e), Lex) = 0 6 apparently, the analysts) are prone to generalize (m, (mmm − {b|hm , bi ∈ Lex})) the distribution of morphs to minimal monomi- als first, and later correct any overgeneralizations 1. set S := S(tj) and Lex := ∅ that might arise by using default reasoning, i.e. by 2. While S 6= ∅: remove a least x positing exceptions that override the general rule. from S and set Lex := Lex ∪ Of course, the above strategy alone is not sufficient rem(x, Lex) to capture distributions that are neither monomial, nor elsewhere (I call such distributions “overlap- 3. Set Lex := Lex ∪ O(tj). ping”, cf. the suffixes -en and -t in the German paradigm in Table 2), which might also explain This learner initially assumes that the lexicon is why such paradigms are typologically rare. empty. Then it proceeds adding Boolean functions corresponding to minimal monomials for morphs Table 2: Present tense of some regular verbs in that are in the set S(tj) (i.e., morphs that have ei- German ther monomial or elsewhere distributions). This sg. pl 5I thank Ed Stabler for relating this batch learner to me 1p. -e -en (p.c.). 6For any two Boolean functions b, b0: b−b0 is the function 2p. -st -t that maps e to 1 iff e ∈ ext(b) and e 6∈ ext(b0). Similarly, 3p. -t -en b + b0 is the function that maps e to 1 iff e ∈ ext(b) and e ∈ ext(b0).

70 is done in a particular order, namely in the or- This learner considers all monomials in the or- der in which the morphs can be said to block der of their simplicity (determined by the num- each other. The remaining text is learned by rote- ber of specified features), and if the monomial in memorization. Although this learner is more com- question is consistent with environments associ- plex than the previous two learners, it generalizes ated with a unique morph then these environments fast when applied to paradigms with monomial are added to the extension of the Boolean function and elsewhere distributions. for that morph. As a result, this learner will con- verge faster on paradigms in which morphs can be 5.1 Learner 4: a learner biased towards described with disjunctions of shorter monomials shorter formulas since such monomials are considered first. Next, I discuss a learner for morphological paradigms, proposed by another linguist, David 6 Comparison Adger. Adger describes his learner informally showing how it would work on a few examples. 6.1 Basic properties Below, I formalize his proposal in terms of learn- ing Boolean partitions. The general strategy of First, consider some of the basic properties of the this learner is to consider simplest monomials first learners presented here. For this purpose, we will (those with the fewer number of specified features) assume that we can apply these learners in an iter- and see how much data they can unambiguously ative fashion to larger and larger batches of data. and non-redundantly account for. If a monomial We say that a learner is consistent if and only if, is consistent with several morphs in the text - it is given a text tj, it always converges on the gram- discarded unless the morphs in question are in free mar generating all the data seen in tj (Osherson variation. This simple strategy is reiterated for the et al., 1986). A learner is monotonic if and only next set of most simple monomials, etc. if for every text t and every point j < k, the hy- Learner 4 (input tj) pothesis the learner converges on at tj is a subset of the hypothesis at t (or for learners that learn 1. Let M be the set of all monomials over F k i by elimination: the hypothesis at t is a superset with i specified features. j of the hypothesis at tk). And, finally, a learner is 2. Let Bi be the set of Boolean functions from generalizing if and only if for some tj it converges environments to truth values corresponding on a hypothesis that makes a prediction beyond the to Mi in the following way: for each mono- elements of tj. mial mn ∈ Mi the corresponding Boolean The table below classifies the four learners ac- function b is such that b(e) = 1 if e is an cording to the above properties. environment consistent with mn; otherwise b(e) = 0. Learner consist. monoton. generalizing 3. Uniqueness check: Learner 1 yes yes yes For a Boolean function b, morph m, and text Learner 2 yes yes yes Learner 3 yes no yes tj let unique(b, m, tj) = 1 iff ext(bm) ⊆ 0 Learner 4 yes yes yes ext(b) and ¬∃hm , ei ∈ L(tj), s.t. e ∈ ext(b) and e 6∈ ext(bm). All learners considered here are generalizing 1. set Lex := Σ × ∅ and i := 0; and consistent, but they differ with respect to 2. while Lex does not correspond to monotonicity. Learner 3 is non-monotonic while L(tj) AND i ≤ |F | do: the remaining learners are monotonic. While for each b ∈ Bi, for each m, s.t. monotonicity is a nice computational property, ∃hm, ei ∈ L(tj): some aspects of human language acquisition are suggestive of a non-monotonic learning strategy, • unique(b, m, t ) = 1 if j then e.g. the presence of overgeneralization errors and hm, fi hm, f + bi replace with their subsequent corrections by children(Marcus et Lex in al., 1992). Thus, the fact that Learner 3 is non- i ← i + 1 monotonic might speak in its favor.

71 6.2 Illustration tion, however, on top of that it also restricts its ini- tial hypothesis which leads to faster convergence. To demonstrate how the learners work, consider this simple example. Suppose we are learning the Let’s now consider the behavior of learner 3 on following distribution of morphs A and B over 2 example 1. Recall that this learner first computes binary features. minimal monomials of all morphs, and checks in they have monomial or elsewhere distributions (4) Example 1 (this is done via the relation C+). In this case, A has a monomial distribution, and B has an else- +f1 −f1 where distribution. Therefore, the learner first +f2 AB computes the Boolean function for A whose exten- −f2 BB sion is simply (+f1; +f2); and then the Boolean function for B, whose extension includes environ- ments consistent with (*;*) minus those consistent Suppose further that the text t3 is: with (+f1; +f2), which yields the following hy- A +f1; +f2 pothesis: B −f1; +f2 B +f1; −f2 ext(bA) [+f1; +f2] ext(bB)[−f1; +f2][+f1; −f2][−f1; −f2] Learner 1 generalizes right away by assuming that every morph can appear in every environment That is, Learner 3 generalizes and converges on which leads to massive overgeneralizations. These the right language after seeing text t3. overgeneralizations are eventually eliminated as Learner 4 also converges at this point. This more data is discovered. For instance, after pro- learner first considers how much data can be un- cessing the first pair in the text above, the learner ambiguously accounted for with the most minimal “learns” that B does not occur in any environ- monomial (*;*). Since both A and B occur in en- ment consistent with (+f1; +f2) since it has just vironments consistent with this monomial, noth- seen A in that environment. After processing t3, ing is added to the lexicon. On the next round, Learner 1 has the following hypothesis: it considers all monomials with one specified fea- A (+f1; +f2) ∨ (−f1; −f2) ture. 2 such monomials, (−f1) and (−f2), are B (−f1) ∨ (−f2) consistent only with B, and so we predict B to ap- pear in the not-yet-seen environment (−f1; −f2). That is, after seeing t3, Learner 2 correctly pre- Thus, the hypothesis that Learner 4 arrives at is the dicts the distribution of morphs in environments same as the hypothesis Learners 3 arrives at after that it has seen, but it still predicts that both A seeing t3. and B should occur in the not-yet-observed en- vironment, (−f1; −f2). This learner can some- 6.3 Differences times converge before seeing all data-points, es- pecially if the input includes a lot of free varia- While the last three learners perform similarly on tion. If fact, if in the above example A and B were the simple example above, there are significant in free variation in all environments, Learner 1 differences between them. These differences be- would have converged right away on its initial set- come apparent when we consider larger paradigms ting of the lexicon. However, in paradigms with no with homonymy and free variation. free variation convergence is typically slow since First, let’s look at an example that involves a the learner follows a very conservative strategy of more elaborate homonymy than example 1. Con- learning by elimination. sider, for instance, the following text. Unlike Learner 1, Learner 2 will converge after (5) Example 2 seeing t3. This is because this learner’s initial hy- A [+f1; +f2; +f3] pothesis is more restricted. Namely, the initial hy- A [+f1; −f2; −f3] pothesis for A includes disjunction of only those A [+f1; +f2; −f3] monomials that are consistent with (+f1; +f2). A [−f1; +f2; +f3] Hence, A is never overgeneralized to (−f1; −f2). B [−f1; −f2; −f3] Like Learner 1, Learner 2 also learns by elimina-

72 Given this text, all three learners will differ in with overlapping distributions (like the German their predictions with respect to the environ- paradigms in 2). However, I also found a bias to- ment (−f1; +f2; −f3). Learner 2 will pre- ward paradigms with the fewer number of relevant dict both A and B to occur in this environment features. This bias is consistent with Learner 4 since not enough monomials will be removed since this learner tries to assume the smallest num- from representations of A or B to rule out ei- ber of relevant features possible. Thus, both learn- ther morph from occurring in (−f1; +f2; −f3). ers have their merits. Learner 3 will predict A to appear in all envi- Another area in which the considered learn- ronments that haven’t been seen yet, including ers make somewhat different predictions has to (−f1; +f2; −f3). This is because in the cur- do with free variation. While I can’t discuss rent text the minimal monomial for A is (∗; ∗; ∗) this at length due to space constraints, let me and A has an elsewhere distribution. On the comment that any batch learner can easily de- other hand, Learner 4 predicts B to occur in tect free-variation before generalizing, which is (−f1; +f2; −f3). This is because the exten- exactly what most of the above learners do (ex- sion of the Boolean function for B includes cept Learner 3, but it can also be changed to do any environments consistent with (−f1; −f3) or the same thing). However, since free variation (−f1; −f2) since these are the simplest monomi- is rather marginal in morphological paradigms, als that uniquely pick out B. it is possible that it would be rather problem- Thus, the three learners follow very different atic. In fact, free variation is more problematic if generalization routes. Overall, Learner 2 is more we switch from the batch learners to incremental cautious and slower to generalize. It predicts free learners. variation in all environments for which not enough data has been seen to converge on a single morph. 7 Directions for further research Learner 3 is unique in preferring monomial and There are of course many other learners one could elsewhere distributions. For instance, in the above consider for learning paradigms, including ap- example it treats A as a ‘default’ morph. Learner proaches quite different in spirit from the ones 4 is unique in its preference for morphs describ- considered here. In particular, some recently pop- able with disjunction of simpler monomials. Be- ular approaches conceive of learning as matching cause of this preference, it will sometimes gener- probabilities of the observed data (e.g., Bayesian alize even after seeing just one instance of a morph learning). Comparing such approaches with the (since several simple monomials can be consistent algorithmic ones is difficult since the criteria for with this instance alone). success are defined so differently, but it would One way to test what the human learners do still be interesting to see whether the kinds of in a situation like the one above is to use artifi- prior assumptions needed for a Bayesian model cial grammar learning experiments. Such experi- to match human performance would have some- ments have been used for learning individual con- thing in common with properties that the learn- cepts over features like shape, color, texture, etc. ers considered here relied on. These properties Some work on concept learning suggests that it is include the disjoint nature of paradigm cells, the subjectively easier to learn concepts describable prevalence of monomial and elsewhere morphs, with shorter formulas (Feldman, 2000; Feldman, and the economy considerations. Other empirical 2004). Other recent work challenges this idea (La- work that might help to differentiate Boolean par- fond et al., 2007), showing that people don’t al- tition learners (besides typological and experimen- ways converge on the most minimal representa- tal work already mentioned) includes finding rele- tion, but instead go for the more simple and gen- vant language acquisition data, and examining (or eral representation and learn exceptions to it (this modeling) language change (assuming that learn- approach is more in line with Learner 3). ing biases influence language change). Some initial results from my pilot experiments on learning partitions of concept spaces (using ab- stract shapes, rather than language stimuli) also References suggest that people find paradigms with else- David Adger. 2006. Combinatorial variation. Journal where distributions easier to learn than the ones of Linguistics, 42:503–530.

73 Avrim Blum. 1992. Learning Boolean functions in an Leslie G. Valiant. 1984. A theory of the learnable. infinite attribute space. Machine Learning, 9:373– CACM, 17(11):1134–1142. 386. Daniel Zeman. 2007. Unsupervised acquiring of mor- Michael Cysouw. 2003. The Paradigmatic Structure phological paradigms from tokenized text. In Work- of Person Marking. Oxford University Press, NY. ing Notes for the Cross Language Evaluation Forum, Budapest. Madarsko. Workshop. Jacob Feldman. 2000. Minimization of complexity in human concept learning. Nature, 407:630–633.

Jacob Feldman. 2004. How surprising is a simple pat- tern? Quantifying ‘Eureka!’. Cognition, 93:199– 224.

John Goldsmith. 2001. Unsupervised learning of a morphology of a natural language. Computational Linguistics, 27:153–198.

Anthony Kroch. 1994. Morphosyntactic variation. In Katharine Beals et al., editor, Papers from the 30th regional meeting of the Chicago Linguistics Soci- ety: Parasession on variation and linguistic theory. Chicago Linguistics Society, Chicago.

Eyal Kushilevitz and Dan Roth. 1996. On learning vi- sual concepts and DNF formulae. Machine Learn- ing, 24:65–85.

Daniel Lafond, Yves Lacouture, and Guy Mineau. 2007. Complexity minimization in rule-based cat- egory learning: revising the catalog of boolean con- cepts and evidence for non-minimal rules. Journal of Mathematical Psychology, 51:57–74.

Gary Marcus, Steven Pinker, Michael Ullman, Michelle Hollander, T. John Rosen, and Fei Xu. 1992. Overregularization in language acquisition. Monographs of the Society for Research in Child Development, 57(4). Includes commentary by Harold Clahsen.

Robert M. Nosofsky, Thomas J. Palmeri, and S.C. McKinley. 1994. Rule-plus-exception model of classification learning. Psychological Review, 101:53–79.

Daniel Osherson, Scott Weinstein, and Michael Stob. 1986. Systems that Learn. MIT Press, Cambridge, Massachusetts.

Katya Pertsova. 2007. Learning Form-Meaning Map- pings in the Presence of Homonymy. Ph.D. thesis, University of California, Los Angeles.

Jeffrey Mark Siskind. 1996. A computational study of cross-situational techniques for learning word-to- meaning mappings. Cognition, 61(1-2):1–38, Oct- Nov.

Matthew G. Snover, Gaja E. Jarosz, and Michael R. Brent. 2002. Unsupervised learning of morphology using a novel directed search algorithm: taking the first step. In Proceedings of the ACL-02 workshop on Morphological and phonological learning, pages 11–20, Morristown, NJ, USA. Association for Com- putational Linguistics.

74 Author Index

Angluin, Dana, 16

Becerra-Bonache, Leonor, 16

Candito, Marie, 49 Casacuberta, Francisco, 24 Cavar,´ Damir, 5 Clark, Alexander, 33 Crabbe,´ Benoit, 49 de la Higuera, Colin, 1

Eyraud, Remi, 33

Geertzen, Jeroen, 7 Gonzalez,´ Jorge, 24

Habrard, Amaury, 33

Infante-Lopez, Gabriel, 58

Luque, Franco M., 58

Pertsova, Katya, 66

Seddah, Djame,´ 49 Stehouwer, Herman, 41 van Zaanen, Menno, 1, 41

75