Computational Linguistics in the Netherlands 2004
Total Page:16
File Type:pdf, Size:1020Kb
Computational Linguistics in the Netherlands 2004 Published by LOT phone: +31 30 253 6006 Trans 10 fax: + 31 30 253 6000 3512 JK Utrecht e-mail: [email protected] The Netherlands http://wwwlot.let.uu.nl/ ISBN 90-76864-91-8 NUR 632 c 2005 by the individual authors Computational Linguistics in the Netherlands 2004 Selected papers from the fifteenth CLIN meeting Edited by Ton van der Wouden Michaela Poß Hilke Reckman Crit Cremers LOT Utrecht 2005 Contents Preface vii Invited Speaker Shalom Lappin 1 Machine Learning and the Cognitive Basis of Natural Language Regular Papers Tamas Biro´ 13 When the Hothead Speaks. Simulated Annealing Optimality Theory for Dutch Fast Speech Wauter Bosma 29 Query-Based Summarization using Rhetorical Structure Theory Anne Hendrik Buist, Wessel Kraaij, Stephan Raaijmakers 45 Automatic Summarization of Meeting Data: A Feasibility Study Christiano Chesi 59 Phases and Complexity in Phrase Structure Building Tim Van de Cruys 75 Between VP Adjuncts and Second Pole in Dutch. A corpus based survey regarding the complements between VP adjuncts and second pole in Dutch Frank Van Eynde 89 Argument Realization in Dutch Nicole Gregoire´ 105 Accentuation of Adpositions and Particles in a Text-to-Speech System for Dutch Lars Hellan, Dorothee Beermann, Jon Atle Gulla, Atle Prange 121 Trailfinder - A Case Study in Extracting Spatial Information Using Deep Language Processing Veronique Hoste, Walter Daelemans 133 Learning Dutch Coreference Resolution v vi Kim Luyckx, Walter Daelemans 149 Shallow Text Analysis and Machine Learning for Authorship Attribtion Jori Mur 161 Off-line Answer Extraction for Dutch QA Lonneke van der Plas, Gosse Bouma 173 Syntactic Contexts for Finding Semantically Related Words Michaela Poß, Ton van der Wouden 187 Extended Lexical Units in Dutch Wojciech Skut 203 Preference-Driven Bimachine Compilation. An Application to TTS Text Normalisation Alexandros Tantos 217 An Interface between Lexical and Discourse Semantics. The case of the light verb “have” List of Contributors 233 Preface It is our pleasure to present this selection of the papers presented at the fifteenth in- stallment of Computational Linguistics in the Netherlands, held at Leiden University on Friday, December 17th, 2004. Organized by the computational linguists of what was at that time called the Leiden Centre for Linguistics (ULCL), this one day con- ference attracted 74 abstracts from no less than four continents, and over one hundred participants. Shalom Lappin, of King’s College London, and Luc Steels, of the Free University, Brussels and the Sony Computer Science Laboratories, Paris, delivered the keynote lectures, on From Universal Grammar to Machine Learning: the Significance of Nat- ural Language Engineering for Theories of Cognition and Emergent Fluid Construc- tion Grammars, respectively. A written version of Prof. Lappin’s paper is included in the present volume. The separate call for papers resulted in the submission of 26 papers, of which only 16 could be included. We wish to express our heartfelt thanks to the reviewing panel consisting of Harald Baaijen, Stefano Bocconi, Antal van den Bosch, Harry Bunt, Alessandra Corda, Arthur Dirksen, Francisca de Jong, Truus Kruijt, Frank Landsber- gen, Anton Nijholt, Gertjan van Noord, Jan Odijk, Manfred Sailer, Rob van der Sandt, Mariet¨ Theune, Mark de Vos, Theo Vosse, Henk Zeevat, and Joost Zwarts, for helping us in the difficult task of selecting the papers most suitable for this volume, as well as to the authors complying with our harsh demands with respect to formatting issues and time schedules. As you will see from the table of contents, the papers cover a wide variety of topics from the field of computational linguistics. In our view, the alphabet was the ordering principle that was the least arbitrary. Finally, we wish to thank everyone who helped with the local organization, the for- mer ULCL for cooperation and financial support, LOT for their willingness to publish this volume in their Occasional Series, and to the Faculty of Arts of Leiden University for providing a venue. Leiden, October 2005 Crit Cremers Michaela Poß Hilke Reckman Ton van der Wouden vii Machine Learning and the Cognitive Basis of Natural Language Shalom Lappin Department of Philosophy, King’s College London Abstract Machine learning and statistical methods have yielded impressive results in a wide variety of natural language processing tasks. These advances have generally been regarded as engineering achievements. In fact it is possible to argue that the success of machine learning methods is significant for our understanding of the cognitive basis of language acquisition and processing. Recent work in unsupervised grammar induction is particularly relevant to this issue. It suggests that knowledge of language can be achieved through general learning procedures, and that a richly articulated language faculty is not required to explain its acquisition. 1 Introduction The past fifteen years have seen a massive expansion in the application of information theoretic and machine learning (ML) methods to natural language processing. This work has yielded impressive results in accuracy and coverage for engineering systems addressing a wide variety of tasks in areas like speech recognition, morphological analysis, parsing, semantic interpretation, and dialogue management. It is worth con- sidering whether the inductive learning mechanisms that these methods employ have consequences not simply for natural language engineering, but also for our under- standing of the cognitive basis of human language acquisition and processing. A view common among computational linguists is that the information theoretic methods used to construct robust NLP systems are engineering tools. On this approach language engineering does not bear directly on the cognitive properties of grammar and human language processing. Shieber (2004 pc) suggests that the success of in- formation theoretic methods in NLP may have implications for the scientific under- standing of natural language. Pereira (2000) argues that current work on grammar induction has revived Harris’ (1951), (1991) program for combining formal grammar and information theory in the study of language. Most machine learning has used supervised learning techniques. These have lim- ited implications for theories of human language learning, given that they require an- notation of the training data with the structures and rules that are to be learned. How- ever, recently there has been an increasing amount of promising research on unsuper- vised machine learning of linguistic knowledge. The results of this research suggest the computational viability of the view that general cognitive learning and projection mechanisms rather than a richly articulated language faculty may be sufficient to sup- port language acquisition and interpretation. In Section 2 I briefly consider the poverty of stimulus argument that has been traditionally invoked to motivate a distinct language faculty. Section 3 reviews major developments in the application of supervised machine learning to different areas of NLP. Section 4 considers some recent work in unsupervised learning of NLP systems. In Section 5 I attempt to clarify the central issues in the debate between the distinct 1 2 Shalom Lappin language faculty and generalized learning model approaches. Finally Section 6 states the main conclusions of this discussion and suggests directions for future work. 2 The Poverty of Stimulus Argument for a Language Faculty Chomsky (1957), (1965), (1981), (1986), (1995), (2000) argues for a richly articu- lated language acquisition device to account for the speed and efficiency with which humans acquire natural language on the basis of “sparse” evidence. This device de- fines the initial state of the language learner. Its structures and constraints represent a Universal Grammar (UG) that the learner brings to the language acquisition problem. In earlier versions of the language faculty hypothesis (as in Chomsky (1965), for ex- ample) this device is presented as a schema that defines the set of possible grammars and an evaluation metric that ranks grammars conforming to this schema for a partic- ular natural language. Since Chomsky (1981) UG has been described as an intricate set of principles containing parameters at significant points in their specification. On the Principles and Parameters (P&P) approach exposure to a very limited amount of linguistic data allows the learner to determine parameter values in order to produce a grammar for his/her language. Chomsky (2000) (p. 8) describes this Principles and Parameters model in the following terms. We can think of the initial state of the faculty of language as a fixed net- work connected to a switch box; the network is constituted of the prin- ciples of language, while the switches are options to be determined by experience. When switches are set one way, we have Swahili; when they are set another way, we have Japanese. Each possible human language is identified as a particular setting of the switches—a setting of parameters, in technical terminology. If the research program succeeds, we should be able literally to deduce Swahili from one choice of settings, Japanese from another, and so on through the languages that humans acquire. The empirical conditions of language acquisition require that the switches can be set on the basis of the very limited properties of information that is available to the child. The language faculty view relies primarily on the poverty of stimulus argument to mo- tivate the claim that a powerful task-specific mechanism is required for language ac- quisition. According this argument the complex grammar that a child achieves within a short period, with very limited data cannot be explained through general learning procedures of the kind involved in other cognitive tasks. This is, at root, a “What else could it be?” argument. It asserts that, given the com- plexity of grammatical knowledge and the lack of evidence for discerning its proper- ties, we are forced to the conclusion that much of this knowledge is not learned at all but must already be present in the initial design of the language learner.