Trans-European Language Resources Infrastructure) European Seminar (1St, Tihany, Hungary, September 15-16

Total Page:16

File Type:pdf, Size:1020Kb

Trans-European Language Resources Infrastructure) European Seminar (1St, Tihany, Hungary, September 15-16 DOCUMENT RESUME ED 413 738 FL 024 759 AUTHOR Rettig, Heike, Ed. TITLE Language Resources for Language Technology: Proceedings of the TELRI (Trans-European Language Resources Infrastructure) European Seminar (1st, Tihany, Hungary, September 15-16, 1995) . INSTITUTION Institut fuer deutsche Sprache, Mannheim (Germany). ISBN ISBN-963-8461-99-3 PUB DATE 1995-00-00 NOTE 189p.; For individual articles, see FL 024 760-778. "In collaboration with Julia Pajzs and Gabor Kiss." PUB TYPE Collected Works - Proceedings (021) EDRS PRICE MF01/PC08 Plus Postage. DESCRIPTORS *Computational Linguistics; Computer Software; *Computer Software Development; Contrastive Linguistics; Czech; Data Processing; Dictionaries; *Discourse Analysis; Dutch; English; Foreign Countries; Information Technology; *Language Planning; Language Research; *Languages; Languages for Special Purposes; Linguistic Theory; Machine Translation; Morphology (Languages); Research Methodology; Russian; Shared Resources and Services; Slovenian; Spelling; Structural Analysis (Linguistics); Suprasegmentals; Uncommonly Taugi:t Languages; Vocabulary IDENTIFIERS Speech Recognition ABSTRACT. This proceedings contains papers from the first European seminar of the Trans-European Language Resources Infrastructure (TELRI) include: "Cooperation with Central and Eastern Europe in Language Engineering" (Poul Andersen); "Language Technology and Language Resources in China" (Feng Zhiwei); "Public Domain Generic Tools: An Overview" (Tomaz Erjavec); "The 'Terminology Market'" (Christian Galinski); "Lexical Resources and Their Application" (Martin Gellerstam); "Encoding Standards for Linguistic Corpora" (Nancy Ide); "Machine Translation: State of the Art, Trends and the User Perspective" (Steven Krauwer); "MULTEXT-EAST: Multilingual Text Tools and Corpora for Central and Eastern European Languages" (Erjavec, Ide, Vladimir Petkevic, Jean Veronis); "Speech Recognition: A General Overview" (Luis de Sopena); "Language Resources: The Foundations of a Pan-European Information Society" (Wolfgang Teubert); "Rail-Lex Slovenia--A Modern Railway Dictionary" (Primoz Jakopin); "A New Dutch Spelling Guide" (J. G. Kruyt, P. G. J. van Sterkenburg); "European . Language Resources and the Treasury of the Computerised Russian Language Fund" (Elena Paskaleva); "HUMOR--A Morphological System for Corpus Analysis" (Gabor Proszeky); "CORDON--A Joint Venture Case Study" (Norbert Volz); "EVA--A Textual Data Processing Tool" (Jakopin); "On-line Access to Linguistically Annotated Text Corpora" (Kruyt, S. A. Raaijmakers, P. H. J. van der Kamp, R. J. van Strien); "Tagging a Highly Inflected Language" (Paskaleva, Bojanka Zaharieva); and "A Simple Czech and English Probabilistic Tagger: A Comparison" (Barbora Hladka, Jan Hajic). (MSE) 00 RANS-EUROPEA LANGUAGE R7URCES INF STRUCTURE PROC DINGS O THE FI T EUR PEAN SEMINAR 7 U.S. DEPARTMENT OF EDUCATION Office of Educational Research and Improvement PERMISSIONTO REPRODUCE AND EDUCATIONAL RESOURCES INFORMATION CENTER (ERIC) DISSEMINATE THIS MATERIAL his document has been reproduced as HAS BEEN GRANTED BY received from the person or organization \cfroriginating it. Minor changes have been made to Noc--\19.Q.4-- improve reproduction quality. Points of view or opinions stated in this document do not necessarily represent TO THE EDUCATIONAL RESOURCES official OERI position or policy. INFORMATION CENTER (ERIC) Tanguae Re ours for Languge Tecnolo Tihany, Hu gary S ptember 15 an16, 1995 Edited by Hei e Rettig In with Julia Pajzs and Giboopr Ki 2 7-a-syc-Y AVAILABLE PROCEEDINGS OF THE FIRST TELRI SEMINAR IN TIHANY 'TLRI TRANS-EUROPEAN LANGUAGE RESOURCES INFRASTRUCTURE PROCEEDINGS OF THE FIRST EUROPEAN SEMINAR "Language Resources for Language Technology" Tihany, Hungary September 15 and 16, 1995 Edited by Heike Rettig In Collaboration with Julia Pajzs and Gabor Kiss 4 ISBN 963 8461 99 3 Contents PREFACE 7 P. ANDERSEN Cooperation with Central and Eastern Europe in Language Engineering 9 FENG ZHIWEI Language Technology and Language Resources in China 21 T. ERJAVEC Public Domain Generic Tools: An Overview 37 C. GALINSKI The 'Terminology Market' 49 M. GELLERSTAM Lexical Resources and Their Application 57 N. IDE Encoding Standards for Linguistic Corpora 65 S. KRAUWER Machine Translation: State of the Art, Trends and the User Perspective 79 T. ERJAVEC, N. IDE, V. PETKEVI, J. VERONIS MULTEXT-EAST: Multilingual Text Tools and Corpora for Central and Eastern European Languages 87 L. DE SOPESTA Speech Recognition: A General Overview 99 W. TEUBERT Language Resources: The Foundations of a Pan-European Information Society 105 P. JAKOPIN Rail-lex SloveniaA Modern Railway Dictionary 129 J.G. KRUYT, P. G. J. V. STERKENBURG A New Dutch Spelling Guide 133 E. PASKALEVA European Language Resources and the Treasury, of the Computerised Russian Language Fund 143 6 Contents G. PROSZEKY HUMOR A Morphological System for Corpus Analysis 149 N. VOLZ CORDON A Joint Venture Case Study 159 P. JAKOPIN EVA A Textual Data Processing Tool 169 J.G. KRUYT, S. A. RAAIJMAKERS, P. H. J. VANDER KAMP, R. J. VAN STRIEN On-line Access to Linguistically Annotated TextCorpora of Dutch via Internet 173 E. PASKALEVA, B. ZAHARIEVA Tagging a Highly Inflected Language 179 B. HLADKA, J. HAJI A Simple Czech and English Probabilistic Tagger: A Comparison 191 7 Preface This book documents the presentations given at the first TELRI Seminar "Language Resources for Language Technology" held in Tihany, Hungary from September 15-16, 1995. TELRI (Trans-European Language Resources Infrastructure)is an international project funded by the European Com- mission. It brings together 22 institutions from 17 European countries. TELRI will set up a permanent network of leading national language and language technology centers and will pool existing language resources and generic software tools. Language technology needs language resources such as corpora of spoken and written language, word lists, lexicons, machine readable dictionaries, and tools to extract linguistic knowledge to develop and optimise products. Since such kinds of language resources are often available in high quality in the domain of public research, one aim of the first TELRI Seminar "Language Resources for Language Technology" was to provide a platform where re- searchers in the field of corpus linguistics and natural language processing, lingware developers, and end-users of language resources can meet and discuss new ideas and possibilities for co-operation. The TELRI Seminar provided information on the existing and emerging European language technology infrastructure, showed the state of the art in relevant fields of natural language processing, demonstrated joint venture projects between (public) research and (private)industry, and gave the opportunity to see computer demonstrations featuring resources, tools, and products in the field of language technology. All of these contributionscame from researchers from the university domain (TELRI members, members from other international projects, members of research institutes) as wellas from representatives of private companies. According to the challenges of a multilingual Europe, the seminar tried to establish an international perspec- tive, reflected by the topics of presentations and the speakers and participants themselves who came from twenty-four different countries. Since TELRI includes numerous members from Central and Eastern Europe, this seminar also offered the opportunity to gain information on the language resources of languages which are envolving to become more 0 8 Preface relevantin the growing and more and more differentiatinglanguage technology market. The collection of articles from the TELRI Seminar illustrates that there is a wide range of activities and topics in the domain of language resources and language technology. Analogue to the structure of the seminar differenttypes of contributions are to be found here whichcover the following different aspects: Information on trends and developments concerning fundamental topics, e.g. terminology, lexical resources, machine translation, public domain generic tools, and speech recognition. - An overview on the developments concerning language technology and language resources in special locations (Europe and China). Information on international standardisation activities for corpora (the "Text Encoding Initiative"). - Information on activities concerning multilingual text tools and corpora for Central and Eastern European Languages (the "MULTEXT EAST" project). - Information on activities (Concerted Actions and Joint Research projects) which are supported by the European Commission. - Presentation of concrete (joint-venture) projects. Description of computer demonstrations given at the seminar. The seminar was the first of its kind organised by TELRI, and we think that we can count it as a success. The presentations and demonstrations provided a broad fundament for discussions in the plenary sesssionsas well as for direct communication and contacts. Two other TELRI Seminarsare planned within the next two years and we think that this firstone was a good beginning. We hope that this presentation will be a useful source of information and, more over, that it will encourage engagement in co-operation activities and stimulate new ideas for projects and products in the field of language processing. I would like to thank Lucie Piro, Joyce Thompson, and Norbert Volz for their logistic and editorial assistance. Heike Rettig Cooperation with Central and Eastern Europe in Language
Recommended publications
  • Chinese Language Learning in the Early Grades
    Chinese Language Learning in the Early Grades: A Handbook of Resources and Best Practices for Mandarin Immersion Asia Society is the leading global and pan-Asian organization working to strengthen relationships and promote understanding among the peoples, leaders, and institutions of Asia and the United States. We seek to increase knowledge and enhance dialogue, encourage creative expression, and generate new ideas across the fields of policy, business, education, arts, and culture. The Asia Society Partnership for Global Learning develops youth to be globally competent citizens, workers, and leaders by equipping them with the knowledge and skills needed for success in an increasingly interconnected world. AsiaSociety.org/Chinese © Copyright 2012 by the Asia Society. ISBN 978-1-936123-28-5 Table of Contents 3 Preface PROGRAM PROFILE: By Vivien Stewart 34 The Utah Dual Language Immersion Program 5 Introduction 36 Curriculum and Literacy By Myriam Met By Myriam Met 7 Editors’ Note and List of Contributors PROGRAM PROFILE: 40 Washington Yu Ying Public Charter School 9 What the Research Says About Immersion By Tara Williams Fortune 42 Student Assessment and Program Evaluation By Ann Tollefson, with Michael Bacon, Kyle Ennis, PROGRAM PROFILE: Carl Falsgraf, and Nancy Rhodes 14 Minnesota’s Chinese Immersion Model PROGRAM PROFILE: 16 Basics of Program Design 46 Global Village Charter Collaborative, By Myriam Met and Chris Livaccari Colorado PROGRAM PROFILE: 48 Marketing and Advocacy 22 Portland, Oregon Public Schools By Christina Burton Howe
    [Show full text]
  • THE PHONOLOGY of PROTO-TAI a Dissertation Presented to The
    THE PHONOLOGY OF PROTO-TAI A Dissertation Presented to the Faculty of the Graduate School of Cornell University In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Pittayawat Pittayaporn August 2009 © 2009 Pittayawat Pittayaporn THE PHONOLOGY OF PROTO-TAI Pittayawat Pittayaporn, Ph. D. Cornell University 2009 Proto-Tai is the ancestor of the Tai languages of Mainland Southeast Asia. Modern Tai languages share many structural similarities and phonological innovations, but reconstructing the phonology requires a thorough understanding of the convergent trends of the Southeast Asian linguistic area, as well as a theoretical foundation in order to distinguish inherited traits from universal tendencies, chance, diffusion, or parallel development. This dissertation presents a new reconstruction of Proto-Tai phonology, based on a systematic application of the Comparative Method and an appreciation of the force of contact. It also incorporates a large amount of dialect data that have become available only recently. In contrast to the generally accepted assumption that Proto-Tai was monosyllabic, this thesis claims that Proto-Tai was a sesquisyllabic language that allowed both sesquisyllabic and monosyllabic prosodic words. In the proposed reconstruction, it is argued that Proto-Tai had three contrastive phonation types and six places of articulation. It had plain voiceless, implosive, and voiced stops, but lacked the aspirated stop series (central to previous reconstructions). As for place of articulation, Proto-Tai had a distinctive uvular series, in addition to the labial, alveolar, palatal, velar, and glottal series typically reconstructed. In the onset, these consonants can combine to form tautosyllabic clusters or sequisyllabic structures.
    [Show full text]
  • Szcenárium Art Journal of the National Theatre MITEM English, April 2019
    szcenárium Art Journal of the National Theatre MITEM English, April 2019 Zsolt Szász: In the Workshop of Director Attila Vidnyánszky Madách Redivivus – Articles on The Tragedy of Man by Géza Balogh, Ildikó Sirató, Nina Király, Miklós Hubay, Ágnes Pálfi Eszter Katona: Federico García Lorca’s Reception in Hungary Márton P. Gulyás: New Mediality in Woyzeck at the National Theatre in Budapest Valdas Vasiliauskas about Eimuntas Nekrošius and Lithuania’s Youth Theatre “We Understand Our Culture Better Through the Other’s” – Interview with Nina Király AUTHORS Aurylaitė, Kristina (1970) translator, Vytautas Magnus University Balogh, Géza (1936) stage director, theatre historian, board member of UNIMA Durkóné Varga, Nóra (1965) translator, English teacher Hubay, Miklós (1918–2011) playwright, translator, dramaturg Katona, Eszter (1976) reader at Department of Hispanic Studies, University of Szeged Király, Nina (1940–2018) theatre historian, co-worker at the National Theatre in Budapest P. Gulyás, Márton (1980) film aesthete Pálfi, Ágnes (1952) poet, editor of Szcenárium Pinczés, István (1953) stage director, translator Sirató, Ildikó (1966) head of Theatre History Collection at National Széchényi Library, reader at Pannon University Szász, Zsolt (1959) puppeteer, dramaturg, stage director, managing editor of Szcenárium Vasiliauskas, Valdas (1951) theatre critic, editor, politician Vértes, László (1966) translator, interpreter Vidnyánszky, Attila (1964) stage director, general manager of the National Theatre in Budapest Támogatók PUBLISHER
    [Show full text]
  • World Languages
    WORLD LANGUAGES The courses described in this section are designed to help students learn to communicate effectively in a world language. Major emphasis is placed on developing students’ ability to comprehend what they hear and read and to express their thoughts orally and in writing. In addition to developing their communication skills, students will develop an awareness of and appreciation for other cultures. The world languages instructional program is designed to help students: • Understand an educated fluent speaker conversing about topics of general interest and speaking in such media as news broadcasts, plays, movies, and telecasts. • Speak fluently and comprehensibly on a range of topics. • Understand directly, without translating, the content of nontechnical writing, selected works of literature, and articles of general interest from periodicals. • Write comprehensibly for formal and informal purposes. • Develop awareness of the cultures of people speaking the world languages. At the elementary level, world languages instruction is given in magnet schools in the Spanish Language Immersion Magnets (SLIM) and the French Language Immersion Magnet (FLIM). At the secondary level, the modern world languages offered are Filipino, French, German, Portuguese, Japanese, Mandarin Chinese, and Spanish. Latin is offered to students interested in the study of a classical language. American Sign Language also meets the high school graduation requirement for world languages and introduces the basic structure of the language and development of its use within the deaf culture. World Languages offerings vary from school to school in response to student interest, staff resources, and other factors. In all cases, however, curriculum and instruction are aligned with the foreign language standards adopted by the California Department of Education in January 2019 (found in this PDF document www.cde.ca.gov/be/st/ss/documents/wlstandards.pdf ), as well as the 2020 Foreign Language Framework for California Public Schools.
    [Show full text]
  • EAMT Machine Translation Workshop TKE ’96, Vienna, Austria, 29 - 30 August 1996
    EAMT Machine Translation Workshop TKE ’96, Vienna, Austria, 29 - 30 August 1996 Proceedings Technische Universität Wien, Elektrotechnisches Institut Gusshausstrasse 27-29, A-1040 Wien, Österreich Kontaktraum For general information about the EAMT: [email protected] EAMT Secretariat ISSCO 54, route des Acacias CH-1227 Carouge (Geneva) Switzerland Tel: +41 22 705 7115 Fax: +41 22 300 1086 The International Association for Machine Translation is a non-profit organisation registered in Switzerland. Compiled by Dimitri Theologitis. Thanks go to all the members of the EAMT Committee for their invaluable help with the organisation of this workshop. D. Theologitis, November 1996 2 EAMT Workshop, TKE’96, Vienna, Austria, 29 - 30 August 1996. Proceedings Table of Contents What is the EAMT? ..................................................................................................... 5 Introduction John Hutchins, EAMT President.................................................................................. 7 Session 1: Controlled Languages, Localisation Chair: John Hutchins................................................................................................... 9 Bringing Controlled Language Support to the Desktop Drs Michiel de Koning............................................................................................... 11 Machine Translation, Translation Memories and the Phrasal Lexicon: The Localisation Perspective Reinhard Schäler.......................................................................................................
    [Show full text]
  • Vernacular Names for Taro in the Indo-Pacific Region and Their
    VERNACULAR NAMES FOR TARO IN THE INDO-PACIFIC REGION AND THEIR POSSIBLE IMPLICATIONS FOR CENTRES OF DIVERSIFICATION AND SPREAD Paper submitted for the proceedings of the session on taro systems at the 19 th IPPA, Hanoi, December 2009 Revised after referee comments for final submission Roger Blench Kay Williamson Educational Foundation 8, Guest Road Cambridge CB1 2AL United Kingdom Voice/Ans 0044-(0)1223-560687 Mobile worldwide (00-44)-(0)7967-696804 E-mail [email protected] http://www.rogerblench.info/RBOP.htm Cambridge, 18 May, 2011 TABLE OF CONTENTS ACRONYMS .................................................................................................................................................... 2 1. Introduction................................................................................................................................................... 1 2. Language phyla of the Indo-Pacific .............................................................................................................. 2 3. The patterns of vernacular names.................................................................................................................. 2 3.1 General.................................................................................................................................................... 2 3.2 #traw ʔ /#tales .......................................................................................................................................... 3 3.3 #ma .........................................................................................................................................................
    [Show full text]
  • Commission of the European Cqmviunities
    COMMISSION OF THE EUROPEAN CQMVIUNITIES C0M(94) 69 final Brussels, 20.09.1994 COMMUNICATION FROM THE COMMISSION TO THE COUNCIL AND THE EUROPEAN PARLIAMENT Final evaluation of the results of Eurotra: a specific programme concerning the preparation of the development of an operational Eurotra system for Machine Translation. COMMMUNICATION OF THE COMMISSION TO COUNCIL AND THE EUROPEAN PARLIAMENT Subject: Final evaluation of the results of Eurotra: a specific programme concerning the preparation of the development of an operational Eurotra system for Machine Translation I. INTRODUCTION 1. This communication concerns the evaluation of the results of the Eurotra research programme. The evaluation was carried out through a panel of independent experts, in accordance with Article 4 of the Council Decision 90/664/EEC of 26.11.90 concerning "the preparation of the development of an operational Eurotra system". The Decision also stipulates that the evaluation should be transmitted to Council and the European Parliament. The evaluation report entitled "Final Review Panel Report, February 1993" and the Opinion of the Eurotra Advisory Committee are annexed to this Communication. 2. This communication gives a short overview of the Eurotra programme, the main conclusions and recommendations of the final evaluation and the position of the Commission. H. THE EUROTRA PROGRAMME 3. In November 1982, the Council decided to launch the Eurotra research and development programme (Council decision 82/752/EEC of 4.11.82). The objective was to overcome language barriers: "the multilingual nature of the European Community is of high cultural value, but is also in practice an obstacle to closer ties between the peoples of the Community, to communications and to the development of the internal and external trade of the Community".
    [Show full text]
  • It's All “Dutch” to Me: a Crashcourse in the Sounds
    J.D. Smith, Ph.D., Genealogist IT’S ALL “DUTCH” TO ME: A CRASHCOURSE IN THE SOUNDS OF GERMAN BACKGROUND The goal of this talk is to introduce the sounds of German, as well as basic linguistic concepts, to help participants further their German and German-American genealogy research. While not all of us are in the position to pick up a second language, learning the sounds of German is an easy way to develop your ear and think creatively about genealogical problems. We will do this first by learning the phonetics of spoken German, and second by applying that knowledge to English-language examples. With practice, participants can use these skills to recognize anglicized texts and spellings: a skill vital not only to tracking ancestors but also making efficient use of search engines and indexes. Why Take a Linguistics-Based Approach to German-American Genealogy? i. Written texts comprise the most common kinds of evidence we encounter day- to-day in genealogy. These texts are as much a history of language as they are of people, places, and events we wish to study. ii. It encourages you to think not only about a document’s content but more so its context: any number of factors can influence the shaping of a document via the author/speaker, audience, and also the genre/form of document itself. iii. Before 1930, large-scale efforts at documentation like the U. S. Federal Census were largely recorded by hand. Even birth registers, most often completed by county-level notaries, were handwritten. Name spelling in these documents was not prescriptive but rather descriptive.
    [Show full text]
  • Chen Hawii 0085A 10047.Pdf
    PROTO-ONG-BE A DISSERTATION SUBMITTED TO THE GRADUATE DIVISION OF THE UNIVERSITY OF HAWAIʻI AT MĀNOA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN LINGUISTICS DECEMBER 2018 By Yen-ling Chen Dissertation Committee: Lyle Campbell, Chairperson Weera Ostapirat Rory Turnbull Bradley McDonnell Shana Brown Keywords: Ong-Be, Reconstruction, Lingao, Hainan, Kra-Dai Copyright © 2018 by Yen-ling Chen ii 知之為知之,不知為不知,是知也。 “Real knowledge is to know the extent of one’s ignorance.” iii Acknowlegements First of all, I would like to acknowledge Dr. Lyle Campbell, the chair of my dissertation and the historical linguist and typologist in my department for his substantive comments. I am always amazed by his ability to ask mind-stimulating questions, and I thank him for allowing me to be part of the Endangered Languages Catalogue (ELCat) team. I feel thankful to Dr. Shana Brown for bringing historical studies on minorities in China to my attention, and for her support as the university representative on my committee. Special thanks go to Dr. Rory Turnbull for his constructive comments and for encouraging a diversity of point of views in his class, and to Dr. Bradley McDonnell for his helpful suggestions. I sincerely thank Dr. Weera Ostapirat for his time and patience in dealing with me and responding to all my questions, and for pointing me to the directions that I should be looking at. My reconstruction would not be as readable as it is today without his insightful feedback. I would like to express my gratitude to Dr. Alexis Michaud.
    [Show full text]
  • Stellvertretung As Vicarious Suffering in Dietrich Bonhoeffer
    Stellvertretung as Vicarious Suffering in Dietrich Bonhoeffer This dissertation is submitted for the degree of Doctor of Philosophy. by Samuel Paul Randall St. Edmund’s College December 2018 Stellvertretung as Vicarious Suffering in Dietrich Bonhoeffer Abstract Stellvertretung represents a consistent and central hermeneutic for Bonhoeffer. This thesis demonstrates that, in contrast to other translations, a more precise interpretation of Bonhoeffer’s use of Stellvertretung would be ‘vicarious suffering’. For Bonhoeffer Stellvertretung as ‘vicarious suffering’ illuminates not only the action of God in Christ for the sins of the world, but also Christian discipleship as participation in Christ’s suffering for others; to be as Christ: Schuldübernahme. In this understanding of Stellvertretung as vicarious suffering Bonhoeffer demonstrates independence from his Protestant (Lutheran) heritage and reflects an interpretation that bears comparison with broader ecumenical understanding. This study of Bonhoeffer’s writings draws attention to Bonhoeffer’s critical affection towards Catholicism and highlights the theological importance of vicarious suffering during a period of renewal in Catholic theology, popular piety and fictional literature. Although Bonhoeffer references fictional literature in his writings, and indicates its importance in ethical and theological discussion, there has been little attempt to analyse or consider its contribution to Bonhoeffer’s theology. This thesis fills this lacuna in its consideration of the reception by Bonhoeffer of the writings of Georges Bernanos, Reinhold Schneider and Fyodor Dostoevsky. Each of these writers features vicarious suffering, or its conceptual equivalent, as an important motif. According to Bonhoeffer Christian discipleship is the action of vicarious suffering (Stellvertretung) and of Verantwortung (responsibility) in love for others and of taking upon oneself the Schuld that burdens the world.
    [Show full text]
  • A Latin Finite-State Morphology Encoding Vowel Quantity
    Open Linguistics 2016; 2:386–392 Research Article Open Access Uwe Springmann*, Helmut Schmid, and Dietmar Najock LatMor: A Latin Finite-State Morphology Encoding Vowel Quantity DOI 10.1515/opli-2016-0019 Received Feb 29, 2016; accepted May 18, 2016 Abstract: We present the first large-coverage finite-state open-source morphology for Latin (called LatMor) which parses as well as generates vowel quantity information. LatMor is based on the Berlin Latin Lexicon comprising about 70,000 lemmata of classical Latin compiled by the group of Dietmar Najock in their work on concordances of Latin authors (see Rapsch and Najock, 1991) which was recently updated by us. Compared to the well-known Morpheus system of Crane (1991, 1998), which is written in the C programming language, based on 50,000 lemmata of Lewis and Short (1907), not well documented and therefore not easily extended, our new morphology has a larger vocabulary, is about 60 to 1200 times faster and is built in the form of finite-state transducers which can analyze as well as generate wordforms and represent the state-of-the-art implementation method in computational morphology. The current coverage of LatMor is evaluated against Morpheus and other existing systems (some of which are not openly accessible), and is shown to rank first among all systems together with the Pisa LEMLAT morphology (not yet openly accessible). Recall has been analyzed taking the Latin Dependency Treebank¹ as gold data and the remaining defect classes have been identified. LatMor is available under an open source licence to allow its wide usage by all interested parties.
    [Show full text]
  • Statistical Machine Translation from Slovenian to English
    Journal of Computing and Information Technology - CIT 15, 2007, 1, 47–59 47 doi:10.2498/cit.1000760 Statistical Machine Translation from Slovenian to English Mirjam Sepesy Maucec and Zdravko Kacic Faculty of Electrical Engineering and Computer Science, University of Maribor In this paper, we analyse three statistical models for the The historical enlargement of the EU has brought machine translation of Slovenian into English. All of many new and challenging language pairs for them are based on the IBM Model 4, but differ in the type of linguistic knowledge they use. Model 4a uses only machine translation. A lot of work has been basic linguistic units of the text, i.e., words and sentences. done on Czech [4], Polish [12], Croatian [3], In Model 4b, lemmatisation is used as a preprocessing Serbian [20] and not at last Slovenian [6]. step of the translation task. Lemmatisation also makes it possible to add a Slovenian-English dictionary as an The Czech-English machine translation system additional knowledge source. Model 4c takes advantage of the morpho-syntactic descriptions (MSD) of words. is based on dependency trees. Dependency trees In Model 4c, MSD codes replace the automatic word represent the sentence structure, as concentrated classes used in Models 4a and 4b. The models are around the verb. The presented system was out- experimentally evaluated using the IJS-ELAN parallel corpus. performed by the statistical translation system GIZA++/ISI ReWrite Decoder [16, 10, 18], Keywords: statistical machine translation, translation trained on the same corpus. model, lemmatisation, morpho-syntactic description, Slovenian language. The Polish-English MT system[12] uses an elec- tronic dictionary annotated for morphological, syntactic, and partly semantic information.
    [Show full text]