Proceedings of the EACL 2009 Workshop on Computational Linguistic Aspects of Grammatical Inference
Total Page:16
File Type:pdf, Size:1020Kb
EACL 2009 Proceedings of the EACL 2009 Workshop on Computational Linguistic Aspects of Grammatical Inference CLAGI 2009 30 March – 2009 Megaron Athens International Conference Centre Athens, Greece Production and Manufacturing by TEHNOGRAFIA DIGITAL PRESS 7 Ektoros Street 152 35 Vrilissia Athens, Greece c 2009 The Association for Computational Linguistics Order copies of this and other ACL proceedings from: Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected] ii Preface We are delighted to present you with this volume containing the papers accepted for presentation at the 1st workshop CLAGI 2009, held in Athens, Greece, on March 30th 2009. We want to acknowledge the help of the PASCAL 2 network of excellence. Thanks also to Damir Cavar´ for giving an invited talk and to the programme committee for the reviewing and advising. We are indebted to the general chair of EACL 2009, Alex Lascarides, to the publication chairs, Kemal Oflazer and David Schlangen, and to the Workshop chairs Stephen Clark and Miriam Butt, for all their help. We are also grateful for Thierry Murgue’s assistance when putting together the proceedings. Wishing you a very enjoyable time at CLAGI 2009! Menno van Zaanen and Colin de la Higuera CLAGI 2009 Programme Chairs iii CLAGI 2009 Program Committee Program Chairs: Menno van Zaanen, Tilburg University (The Netherlands) Colin de la Higuera, University of Saint-Etienne,´ (France) Program Committee Members: Pieter Adriaans, University of Amsterdam (The Netherlands) Srinivas Bangalore, AT&T Labs-Research (USA) Leonor Becerra-Bonache, Yale University (USA) Rens Bod, University of Amsterdam (The Netherlands) Antal van den Bosch, Tilburg University (The Netherlands) Alexander Clark, Royal Halloway, University of London (UK) Walter Daelemans, University of Antwerp (Belgium) Shimon Edelman, Cornell University (USA) Jeroen Geertzen, University of Cambridge (UK) Jeffrey Heinz, University of Delaware (USA) Colin de la Higuera, University of Saint-Etienne´ (France) Alfons Juan, Polytechnic University of Valencia (Spain) Frantisek Mraz, Charles University (Czech Republic) Georgios Petasis, National Centre for Scientific Research (NCSR) ”Demokritos” (Greece) Khalil Sima’an, University of Amsterdam (The Netherlands) Richard Sproat, University of Illinois at Urbana-Champaign (USA) Menno van Zaanen, Tilburg University (The Netherlands) Willem Zuidema, University of Amsterdam (The Netherlands) v Table of Contents Grammatical Inference and Computational Linguistics Menno van Zaanen and Colin de la Higuera . 1 On Bootstrapping of Linguistic Features for Bootstrapping Grammars Damir Cavar............................................................................´ 5 Dialogue Act Prediction Using Stochastic Context-Free Grammar Induction Jeroen Geertzen . 7 Experiments Using OSTIA for a Language Production Task Dana Angluin and Leonor Becerra-Bonache . 16 GREAT: A Finite-State Machine Translation Toolkit Implementing a Grammatical Inference Approach for Transducer Inference (GIATI) Jorge Gonzalez´ and Francisco Casacuberta . 24 A Note on Contextual Binary Feature Grammars Alexander Clark, Remi Eyraud and Amaury Habrard . 33 Language Models for Contextual Error Detection and Correction Herman Stehouwer and Menno van Zaanen . 41 On Statistical Parsing of French with Supervised and Semi-Supervised Strategies Marie Candito, Benoit Crabbe´ and Djame´ Seddah . 49 Upper Bounds for Unsupervised Parsing with Unambiguous Non-Terminally Separated Grammars Franco M. Luque and Gabriel Infante-Lopez. .58 Comparing Learners for Boolean partitions: Implications for Morphological Paradigms Katya Pertsova . 66 vii Grammatical Inference and Computational Linguistics Menno van Zaanen Colin de la Higuera Tilburg Centre for Creative Computing University of Saint-Etienne´ Tilburg University France Tilburg, The Netherlands [email protected] [email protected] 1 Grammatical inference and its links to larly, stemming from computational linguistics, natural language processing one can point out the work relating language learn- ing with more complex grammatical formalisms When dealing with language, (machine) learning (Kanazawa, 1998), the more statistical approaches can take many different faces, of which the most based on building language models (Goodman, important are those concerned with learning lan- 2001), or the different systems introduced to au- guages and grammars from data. Questions in tomatically build grammars from sentences (van this context have been at the intersection of the Zaanen, 2000; Adriaans and Vervoort, 2002). Sur- fields of inductive inference and computational veys of related work in specific fields can also linguistics for the past fifty years. To go back be found (Natarajan, 1991; Kearns and Vazirani, to the pioneering work, Chomsky (1955; 1957) 1994; Sakakibara, 1997; Adriaans and van Zaa- and Solomonoff (1960; 1964) were interested, for nen, 2004; de la Higuera, 2005; Wolf, 2006). very different reasons, in systems or programs that could deduce a language when presented informa- 2 Meeting points between grammatical tion about it. inference and natural language Gold (1967; 1978) proposed a little later a uni- processing fying paradigm called identification in the limit, Grammatical inference scientists belong to a num- and the term of grammatical inference seems to ber of larger communities: machine learning (with have appeared in Horning’s PhD thesis (1969). special emphasis on inductive inference), com- Out of the scope of linguistics, researchers and putational linguistics, pattern recognition (within engineers dealing with pattern recognition, under the structural and syntactic sub-group). There is the impulsion of Fu (1974; 1975), invented algo- a specific conference called ICGI (International rithms and studied subclasses of languages and Colloquium on Grammatical Inference) devoted grammars from the point of view of what could to the subject. These conferences have been held or could not be learned. at Alicante (Carrasco and Oncina, 1994), Mont- Researchers in machine learning tackled related pellier (Miclet and de la Higuera, 1996), Ames problems (the most famous being that of infer- (Honavar and Slutski, 1998), Lisbon (de Oliveira, ring a deterministic finite automaton, given ex- 2000), Amsterdam (Adriaans et al., 2002), Athens amples and counter-examples of strings). An- (Paliouras and Sakakibara, 2004), Tokyo (Sakak- gluin (1978; 1980; 1981; 1982; 1987) introduced ibara et al., 2006) and Saint-Malo (Clark et al., the important setting of active learning, or learn- 2008). In the proceedings of this event it is pos- ing for queries, whereas Pitt and his colleagues sible to find a number of technical papers. Within (1988; 1989; 1993) gave several complexity in- this context, there has been a growing trend to- spired results with which the hardness of the dif- wards problems of language learning in the field ferent learning problems was exposed. of computational linguistics. Researchers working in more applied areas, The formal objects in common between the such as computational biology, also deal with two communities are the different types of au- strings. A number of researchers from that tomata and grammars. Therefore, another meet- field worked on learning grammars or automata ing point between these communities has been the from string data (Brazma and Cerans, 1994; different workshops, conferences and journals that Brazma, 1997; Brazma et al., 1998). Simi- focus on grammars and automata, for instance, Proceedings of the EACL 2009 Workshop on Computational Linguistic Aspects of Grammatical Inference, pages 1–4, Athens, Greece, 30 March 2009. c 2009 Association for Computational Linguistics 1 FSMNLP,GRAMMARS, CIAA,... • Unsupervised parsing 3 Goal for the workshop • Language modelling There has been growing interest over the last few • Transducers, for instance, for years in learning grammars from natural language – morphology, text (and structured or semi-structured text). The – text to speech, family of techniques enabling such learning is usu- – automatic translation, ally called “grammatical inference” or “grammar – transliteration, induction”. The field of grammatical inference is often sub- – spelling correction, . divided into formal grammatical inference, where • Learning syntax with semantics, researchers aim to proof efficient learnability of classes of grammars, and empirical grammatical • Unsupervised or semi-supervised learning of inference, where the aim is to learn structure from linguistic knowledge, data. In this case the existence of an underlying • Learning (classes of) grammars (e.g. sub- grammar is just regarded as a hypothesis and what classes of the Chomsky Hierarchy) from lin- is sought is to better describe the language through guistic inputs, some automatically learned rules. Both formal and empirical grammatical infer- • Comparing learning results in different ence have been linked with (computational) lin- frameworks (e.g. membership vs. correction guistics. Formal learnability of grammars has queries), been used in discussions on how people learn lan- • Learning linguistic structures (e.g. phonolog- guage. Some people mention proofs of (non- ical features, lexicon) from the acoustic sig- )learnability of certain classes of grammars as ar- nal, guments in the empiricist/nativist discussion. On the more practical side, empirical systems that • Grammars and finite state machines in ma- learn grammars