Proceedings of the Workshop on Adaptation Of
Total Page:16
File Type:pdf, Size:1020Kb
INTERNATIONAL WORKSHOP ADaptation oF LanGUaGe ResoUrCes anD TeChnoloGY to New DoMains (AdaptLRTtoND) held in conjunction with the International Conference RANLP - 2009, 14-16 September 2009, Borovets, Bulgaria PROCEEDINGS Edited by Núria Bel, Erhard Hinrichs, Kiril Simov and Petya Osenova The Workshop is endorsed by FLaReNet Project http://www.flarenet.eu/ The Workshop is related to CLARIN Project http://www.clarin.eu/ Borovets, Bulgaria 17 September 2009 International Workshop ADaptation oF LanGUaGe ResoUrCes anD TeChnoloGY to New DoMains ProCeeDinGs Borovets, Bulgaria 17 September 2009 ISBN 978-954-452-009-0 Designed and Printed by INCOMA Ltd. Shoumen, Bulgaria Workshop Organisers Núria Bel Pompeu Fabra University, Spain Erhard Hinrichs Tuebingen University, Germany (co-chair) Petya Osenova Bulgarian Academy of Sciences and Sofia University, Bulgaria Kiril Simov Bulgarian Academy of Sciences, Bulgaria (co-chair) Workshop Programme Committee Núria Bel, Pompeu Fabra University Gosse Bouma, Groningen University António Branco, Lisbon University Walter Daelemans, Antwerp University Markus Dickinson, Indiana University Erhard Hinrichs, Tübingen University Josef van Genabith, Dublin City University Iryna Gurevych, Technische Universität Darmstadt - UKP Lab Atanas Kiryakov, Ontotext AD Vladislav Kubon, Charles University Sandra Kübler, Indiana University Lothar Lemnitzer, DWDS, Berlin-Brandenburgische Akademie der Wissenschaften Bernardo Magnini, FBK Detmar Meurers, Tübingen University Paola Monachesi, Utrecht University Preslav Nakov, National University of Singapore John Nerbonne, Groningen University Petya Osenova, Bulgarian Academy of Sciences and Sofia University Gabor Proszeky, MophoLogic Adam Przepiorkowski, Polish Academy of Sciences Marta Sabou, Open University - UK Kiril Simov, Bulgarian Academy of Sciences Cristina Vertan, Hamburg University Table of Contents Exploiting the Russian National Corpus in the Development of a Russian Resource Grammar Tania Avgustinova and Yi Zhang . 1 Maximal Phrases Based Analysis for Prototyping Online Discussion Forums Postings Gaston Burek and Dale Gerdemann . 12 LEXIE – an Experiment in Lexical Information Extraction John J. Camilleri and Michael Rosner . 19 Adapting NLP and Corpus Analysis Techniques to Structured Imagery Analysis in Classical Chinese Poetry Alex Chengyu Fang, Fengju Lo and Cheuk Kit Chinn . 27 Cross-lingual Adaptation as a Baseline: Adapting Maximum Entropy Models to Bulgarian Georgi Georgiev, Preslav Nakov, Petya Osenova and Kiril Simov . 35 Mix and Match Replacement Rules Dale Gerdemann . 39 Enabling Adaptation of Lexicalised Grammars to New Domains Valia Kordoni and Yi Zhang . 48 QALL-ME needs AIR: a portability study Constantin Orasan,ˇ Iustin Dornescu and Natalia Ponomareva. .50 Personal Health Information Leak Prevention in Heterogeneous Texts Marina Sokolova, Khaled El Emam, Sean Rose, Sadrul Chowdhury, Emilio Neri, Elizabeth Jonker and Liam Peyton . 58 v Conference Program Session 1 09:20-09:30 Opening 09:30-10:30 TBA Atanas Kiryakov (Invited Speaker), OntoText AD, Sofia, Bulgaria 10:30-11:00 Enabling Adaptation of Lexicalised Grammars to New Domains Valia Kordoni and Yi Zhang Session 2 11:30-12:00 Personal Health Information Leak Prevention in Heterogeneous Texts Marina Sokolova, Khaled El Emam, Sean Rose, Sadrul Chowdhury, Emilio Neri, Elizabeth Jonker and Liam Peyton 12:00-12:30 Mix and Match Replacement Rules Dale Gerdemann Session 3 14:00-14:30 Cross-lingual Adaptation as a Baseline: Adapting Maximum Entropy Models to Bulgarian Georgi Georgiev, Preslav Nakov, Petya Osenova and Kiril Simov 14:30-15:00 Exploiting the Russian National Corpus in the Development of a Russian Resource Grammar Tania Avgustinova and Yi Zhang 15:00-15:30 LEXIE - an Experiment in Lexical Information Extraction John J. Camilleri and Michael Rosner 16:00-16:30 Maximal Phrases Based Analysis for Prototyping Online Discussion Forums Post- ings Gaston Burek and Dale Gerdemann 16:30-17:00 QALL-ME needs AIR: a portability study Constantin Orasan,ˇ Iustin Dornescu and Natalia Ponomareva 17:00-17:10 Closing Remarks vii Exploiting the Russian National Corpus in the Development of a Russian Resource Grammar Tania Avgustinova Yi Zhang DFKI GmbH & Saarland University DFKI GmbH & Saarland University P.O. Box 151150 P.O. Box 151150 66041 Saarbrücken, Germany 66041 Saarbrücken, Germany [email protected] [email protected] Abstract the rapid development of grammars for new languages and for the systematic adaptation of grammars to variants of In this paper we present the on-going grammar the same language. This international partnership, which engineering project in our group for developing in parallel 1 resource precision grammars for Slavic languages. The became popular under the name DELPH-IN , is based on a project utilizes DELPH-IN software (LKB/[incr tsdb()]) as shared commitment to re-usable, multi-purpose the grammar development platform, and has strong affinity resources and active exchange. Its leading idea is to to the LinGO Grammar Matrix project. It is innovative in combine linguistic and statistical processing methods for that we focus on a closed set of related but extremely diverse getting at the meaning of texts and utterances. Based on languages. The goal is to encode mutually interoperable contributions from several member institutions and joint analyses of a wide variety of linguistic phenomena, taking development over many years, an open-source repository of into account eminent typological commonalities and software and linguistic resources has been created that systematic differences. As one major objective of the project, we aim to develop a core Slavic grammar whose already enjoys wide usage in education, research, and components can be commonly shared among the set of application building. languages, and facilitate new grammar development. As a In accord with the DELPH-IN community we view rule- showcase, we discuss a small HPSG grammar for Russian. based precision grammars as linguistically-motivated The interesting bit of this grammar is that the development resources designed to model human languages as is assisted by interfacing with existing corpora and processing tools for the language, which saves significant accurately as possible. Unlike statistical grammars, these amount of engineering effort. systems are hand-built by grammar engineers, taking into account the engineer's theory and analysis for how to best Keywords represent various syntactic and semantic phenomena in the Parallel grammar engineering, corpora, Slavic languages language of interest. A side effect of this, however, is that such grammars tend to be substantially different from each 1. Introduction other, with no best practices or common representations. 2 Our long-term goal is to develop grammatical resources As implementations evolved for several languages within for Slavic languages and to make them freely available for the same common formalism, it became clear that the purposes of research, teaching and natural language homogeneity among existing grammars could be increased applications. As one major objective of the project, we and development cost for new grammars greatly reduced aim to develop and implement a core Slavic grammar by compiling an inventory of cross-linguistically valid (or whose components can be commonly shared among the at least useful) types and constructions. To speed up and set of languages, and facilitate new grammar simplify the grammar development as well as provide a development. A decision on the proper set up along with a common framework, making the resulting grammars more commitment to a reliable infrastructure right from the beginning are essential for such an endeavor because the 1 Deep Linguistic Processing with HPSG Initiative (DELPH- implementation of linguistically-informed grammars for IN), URL: http://www.delph-in.net/ natural languages draws on a combination of engineering 2 skills, sound grammatical theory, and software Exceptions do exist, of course: ParGram (Parallel Grammar) development tools. project is one example of multiple grammars developed using a common standard. It aims at producing wide coverage 1.1 DELPH-IN initiative grammars for a wide variety of languages. These are written collaboratively within the linguistic framework of Lexical Current international collaborative efforts on deep Functional Grammar (LFG) and with a commonly-agreed- linguistic processing with Head-driven Phrase Structure upon set of grammatical features. Grammar [1-3] exploit the notion of shared grammar for URL: http://www2.parc.com/isl/groups/nltt/pargram/ 1 Workshop Adaptation of Language Resources and Technology to New Domains 2009 - Borovets, Bulgaria, pages 1–11 comparable the LinGO 3 Grammar Matrix 4 has been set up phenomena are brought into the grammar [13]. The as a multi-lingual grammar engineering project [4] which original Grammar Matrix consisted of types defining the provides a web-based tool designed to support the basic feature geometry, e.g. [14], types for lexical and creation of linguistically-motivated grammatical resources syntactic rules encoding the ways that heads combine with in the framework of HPSG [5]. arguments and adjuncts, and configuration files for the The Grammar Matrix is written in the TDL (type LKB grammar development environment [6] and the PET description language) formalism, which is interpreted by system [12]. Subsequent releases have refined the original the LKB 5 grammar development environment [6]. It is types and