Trans-European Language Resources Infrastructure) European Seminar (1St, Tihany, Hungary, September 15-16
Total Page:16
File Type:pdf, Size:1020Kb
DOCUMENT RESUME ED 413 738 FL 024 759 AUTHOR Rettig, Heike, Ed. TITLE Language Resources for Language Technology: Proceedings of the TELRI (Trans-European Language Resources Infrastructure) European Seminar (1st, Tihany, Hungary, September 15-16, 1995) . INSTITUTION Institut fuer deutsche Sprache, Mannheim (Germany). ISBN ISBN-963-8461-99-3 PUB DATE 1995-00-00 NOTE 189p.; For individual articles, see FL 024 760-778. "In collaboration with Julia Pajzs and Gabor Kiss." PUB TYPE Collected Works - Proceedings (021) EDRS PRICE MF01/PC08 Plus Postage. DESCRIPTORS *Computational Linguistics; Computer Software; *Computer Software Development; Contrastive Linguistics; Czech; Data Processing; Dictionaries; *Discourse Analysis; Dutch; English; Foreign Countries; Information Technology; *Language Planning; Language Research; *Languages; Languages for Special Purposes; Linguistic Theory; Machine Translation; Morphology (Languages); Research Methodology; Russian; Shared Resources and Services; Slovenian; Spelling; Structural Analysis (Linguistics); Suprasegmentals; Uncommonly Taugi:t Languages; Vocabulary IDENTIFIERS Speech Recognition ABSTRACT. This proceedings contains papers from the first European seminar of the Trans-European Language Resources Infrastructure (TELRI) include: "Cooperation with Central and Eastern Europe in Language Engineering" (Poul Andersen); "Language Technology and Language Resources in China" (Feng Zhiwei); "Public Domain Generic Tools: An Overview" (Tomaz Erjavec); "The 'Terminology Market'" (Christian Galinski); "Lexical Resources and Their Application" (Martin Gellerstam); "Encoding Standards for Linguistic Corpora" (Nancy Ide); "Machine Translation: State of the Art, Trends and the User Perspective" (Steven Krauwer); "MULTEXT-EAST: Multilingual Text Tools and Corpora for Central and Eastern European Languages" (Erjavec, Ide, Vladimir Petkevic, Jean Veronis); "Speech Recognition: A General Overview" (Luis de Sopena); "Language Resources: The Foundations of a Pan-European Information Society" (Wolfgang Teubert); "Rail-Lex Slovenia--A Modern Railway Dictionary" (Primoz Jakopin); "A New Dutch Spelling Guide" (J. G. Kruyt, P. G. J. van Sterkenburg); "European . Language Resources and the Treasury of the Computerised Russian Language Fund" (Elena Paskaleva); "HUMOR--A Morphological System for Corpus Analysis" (Gabor Proszeky); "CORDON--A Joint Venture Case Study" (Norbert Volz); "EVA--A Textual Data Processing Tool" (Jakopin); "On-line Access to Linguistically Annotated Text Corpora" (Kruyt, S. A. Raaijmakers, P. H. J. van der Kamp, R. J. van Strien); "Tagging a Highly Inflected Language" (Paskaleva, Bojanka Zaharieva); and "A Simple Czech and English Probabilistic Tagger: A Comparison" (Barbora Hladka, Jan Hajic). (MSE) 00 RANS-EUROPEA LANGUAGE R7URCES INF STRUCTURE PROC DINGS O THE FI T EUR PEAN SEMINAR 7 U.S. DEPARTMENT OF EDUCATION Office of Educational Research and Improvement PERMISSIONTO REPRODUCE AND EDUCATIONAL RESOURCES INFORMATION CENTER (ERIC) DISSEMINATE THIS MATERIAL his document has been reproduced as HAS BEEN GRANTED BY received from the person or organization \cfroriginating it. Minor changes have been made to Noc--\19.Q.4-- improve reproduction quality. Points of view or opinions stated in this document do not necessarily represent TO THE EDUCATIONAL RESOURCES official OERI position or policy. INFORMATION CENTER (ERIC) Tanguae Re ours for Languge Tecnolo Tihany, Hu gary S ptember 15 an16, 1995 Edited by Hei e Rettig In with Julia Pajzs and Giboopr Ki 2 7-a-syc-Y AVAILABLE PROCEEDINGS OF THE FIRST TELRI SEMINAR IN TIHANY 'TLRI TRANS-EUROPEAN LANGUAGE RESOURCES INFRASTRUCTURE PROCEEDINGS OF THE FIRST EUROPEAN SEMINAR "Language Resources for Language Technology" Tihany, Hungary September 15 and 16, 1995 Edited by Heike Rettig In Collaboration with Julia Pajzs and Gabor Kiss 4 ISBN 963 8461 99 3 Contents PREFACE 7 P. ANDERSEN Cooperation with Central and Eastern Europe in Language Engineering 9 FENG ZHIWEI Language Technology and Language Resources in China 21 T. ERJAVEC Public Domain Generic Tools: An Overview 37 C. GALINSKI The 'Terminology Market' 49 M. GELLERSTAM Lexical Resources and Their Application 57 N. IDE Encoding Standards for Linguistic Corpora 65 S. KRAUWER Machine Translation: State of the Art, Trends and the User Perspective 79 T. ERJAVEC, N. IDE, V. PETKEVI, J. VERONIS MULTEXT-EAST: Multilingual Text Tools and Corpora for Central and Eastern European Languages 87 L. DE SOPESTA Speech Recognition: A General Overview 99 W. TEUBERT Language Resources: The Foundations of a Pan-European Information Society 105 P. JAKOPIN Rail-lex SloveniaA Modern Railway Dictionary 129 J.G. KRUYT, P. G. J. V. STERKENBURG A New Dutch Spelling Guide 133 E. PASKALEVA European Language Resources and the Treasury, of the Computerised Russian Language Fund 143 6 Contents G. PROSZEKY HUMOR A Morphological System for Corpus Analysis 149 N. VOLZ CORDON A Joint Venture Case Study 159 P. JAKOPIN EVA A Textual Data Processing Tool 169 J.G. KRUYT, S. A. RAAIJMAKERS, P. H. J. VANDER KAMP, R. J. VAN STRIEN On-line Access to Linguistically Annotated TextCorpora of Dutch via Internet 173 E. PASKALEVA, B. ZAHARIEVA Tagging a Highly Inflected Language 179 B. HLADKA, J. HAJI A Simple Czech and English Probabilistic Tagger: A Comparison 191 7 Preface This book documents the presentations given at the first TELRI Seminar "Language Resources for Language Technology" held in Tihany, Hungary from September 15-16, 1995. TELRI (Trans-European Language Resources Infrastructure)is an international project funded by the European Com- mission. It brings together 22 institutions from 17 European countries. TELRI will set up a permanent network of leading national language and language technology centers and will pool existing language resources and generic software tools. Language technology needs language resources such as corpora of spoken and written language, word lists, lexicons, machine readable dictionaries, and tools to extract linguistic knowledge to develop and optimise products. Since such kinds of language resources are often available in high quality in the domain of public research, one aim of the first TELRI Seminar "Language Resources for Language Technology" was to provide a platform where re- searchers in the field of corpus linguistics and natural language processing, lingware developers, and end-users of language resources can meet and discuss new ideas and possibilities for co-operation. The TELRI Seminar provided information on the existing and emerging European language technology infrastructure, showed the state of the art in relevant fields of natural language processing, demonstrated joint venture projects between (public) research and (private)industry, and gave the opportunity to see computer demonstrations featuring resources, tools, and products in the field of language technology. All of these contributionscame from researchers from the university domain (TELRI members, members from other international projects, members of research institutes) as wellas from representatives of private companies. According to the challenges of a multilingual Europe, the seminar tried to establish an international perspec- tive, reflected by the topics of presentations and the speakers and participants themselves who came from twenty-four different countries. Since TELRI includes numerous members from Central and Eastern Europe, this seminar also offered the opportunity to gain information on the language resources of languages which are envolving to become more 0 8 Preface relevantin the growing and more and more differentiatinglanguage technology market. The collection of articles from the TELRI Seminar illustrates that there is a wide range of activities and topics in the domain of language resources and language technology. Analogue to the structure of the seminar differenttypes of contributions are to be found here whichcover the following different aspects: Information on trends and developments concerning fundamental topics, e.g. terminology, lexical resources, machine translation, public domain generic tools, and speech recognition. - An overview on the developments concerning language technology and language resources in special locations (Europe and China). Information on international standardisation activities for corpora (the "Text Encoding Initiative"). - Information on activities concerning multilingual text tools and corpora for Central and Eastern European Languages (the "MULTEXT EAST" project). - Information on activities (Concerted Actions and Joint Research projects) which are supported by the European Commission. - Presentation of concrete (joint-venture) projects. Description of computer demonstrations given at the seminar. The seminar was the first of its kind organised by TELRI, and we think that we can count it as a success. The presentations and demonstrations provided a broad fundament for discussions in the plenary sesssionsas well as for direct communication and contacts. Two other TELRI Seminarsare planned within the next two years and we think that this firstone was a good beginning. We hope that this presentation will be a useful source of information and, more over, that it will encourage engagement in co-operation activities and stimulate new ideas for projects and products in the field of language processing. I would like to thank Lucie Piro, Joyce Thompson, and Norbert Volz for their logistic and editorial assistance. Heike Rettig Cooperation with Central and Eastern Europe in Language