Apertium-Icenlp: a Rule-Based Icelandic to English Machine Translation System

Total Page:16

File Type:pdf, Size:1020Kb

Apertium-Icenlp: a Rule-Based Icelandic to English Machine Translation System Apertium-IceNLP: A rule-based Icelandic to English machine translation system Martha Dís Brandt, Hrafn Loftsson, Francis M. Tyers Hlynur Sigurþórsson School of Computer Science Dept. Lleng. i. Sist. Inform. Reykjavik University Universitat d’Alacant IS-101 Reykjavik, Iceland E-03071 Alacant, Spain {marthab08,hrafn,hlynurs06}@ru.is [email protected] Abstract translate between the source language (SL) and the target language (TL). We describe the development of a pro- SMT has many advantages, e.g. it is data-driven, totype of an open source rule-based language independent, does not need linguistic ex- Icelandic English MT system, based on perts, and prototypes of new systems can by built → the Apertium MT framework and IceNLP, quickly and at a low cost. On the other hand, the a natural language processing toolkit for need for parallel corpora as training data in SMT Icelandic. Our system, Apertium-IceNLP, is also its main disadvantage, because such corpora is the first system in which the whole are not available for a myriad of languages, espe- morphological and tagging component of cially the so-called less-resourced languages, i.e. Apertium is replaced by modules from languages for which few, if any, natural language an external system. Evaluation shows processing (NLP) resources are available. When that the word error rate and the position- there is a lack of parallel corpora, other machine independent word error rate for our pro- translation (MT) methods, such as rule-based MT, totype is 50.6% and 40.8%, respectively. e.g. Apertium (Forcada et al., 2009), may be used As expected, this is higher than the corre- to create MT systems. sponding error rates in two publicly avail- In this paper, we describe the development able MT systems that we used for com- of a prototype of an open source rule-based parison. Contrary to our expectations, the Icelandic English (is-en) MT system based on → error rates of our prototype is also higher Apertium and IceNLP, an NLP toolkit for process- than the error rates of a comparable system ing and analysing Icelandic texts (Loftsson and based solely on Apertium modules. Based Rögnvaldsson, 2007b). A decade ago, the Ice- on error analysis, we conclude that bet- landic language could have been categorised as ter translation quality may be achieved by a less-resourced language. The current situation, replacing only the tagging component of however, is much better thanks to the development Apertium with the corresponding module of IceNLP and various linguistic resources (Rögn- in IceNLP, but leaving morphological anal- valdsson et al., 2009). On the other hand, no large ysis to Apertium. parallel corpus, in which Icelandic is one of the languages, is freely available. This is the main rea- 1 Introduction son why the work described here was initiated. Our system, Apertium-IceNLP, is the first sys- Over the last decade or two, statistical machine tem in which the whole morphological and tagg- translation (SMT) has gained significant momen- ing component of Apertium is replaced by mod- tum and success, both in academia and industry. ules from an external system. Our motivation SMT uses large parallel corpora, texts that are for developing such a hybrid system was to be translations of each other, during training to derive able to answer the following research question: Is a statistical translation model which is then used to the translation quality of an is-en shallow-transfer c 2011 European Association for Machine Translation. MT system higher when using state-of-the-art Ice- Mikel L. Forcada, Heidi Depraetere, Vincent Vandeghinste (eds.) Proceedings of the 15th Conference of the European Association for Machine Translation, p. 217224 Leuven, Belgium, May 2011 landic NLP modules in the Apertium pipeline as morphologically analysed words, chooses the opposed to relying solely on Apertium modules? most likely sequence of PoS tags. Evaluation results show that the word error rate Lexical selection: A lexical selection mod- (WER) of our prototype is 50.6% and the position- • independent word error rate (PER) is 40.8%1. This ule based on Constraint Grammar (Karlsson is higher than the evaluation results of two publicly et al., 1995) selects between possible transla- available MT systems for is-en translation, Google tions of a word based on sentence context. 2 3 Translate and Tungutorg . This was expected, Lexical transfer: For an unambiguous lex- • given the short development time of our system, ical form in the SL, this module returns the i.e. 8 man-months. For comparison, we know that equivalent TL form based on a bilingual dic- Tungutorg has been developed by an individual, tionary. Stefán Briem, intermittently over a period of two decades4. Structural transfer: Performs local morpho- • Contrary to our expectations, the error rates of logical and syntactic changes to convert the our hybrid system is also higher than the error rates SL into the TL. of an is-en system based solely on Apertium mod- A morphological generator: For a given TL ules. This “pure” Apertium version was devel- • oped in parallel with Apertium-IceNLP. Based on lexical form, this module returns the TL sur- our error analysis, we conclude that better trans- face form. lation quality may be achieved by replacing only 2.1 Language pair specifics the tagging component of Apertium with the corre- sponding module in IceNLP, but leaving morpho- For each language pair, the Apertium platform logical analysis to Apertium. needs a monolingual SL dictionary used by the We think that our work can be viewed as a morphological analyser, a bilingual SL-TL dictio- guideline for other researchers wanting to develop nary used by the lexical transfer module, a mono- hybrid MT systems based on Apertium. lingual TL dictionary used by the morphological generator, and transfer rules used by the struc- 2 Apertium tural transfer module. The dictionaries and transfer rules specific to the is-en pair will be discussed in The Apertium shallow-transfer MT platform was Sections 4.2 and 4.3, respectively. originally aimed at the Romance languages of the The lexical selection module is a new module Iberian peninsula, but has also been adapted for in the Apertium platform and the is-en pair is the other languages, e.g. Welsh (Tyers and Don- first released pair to make extensive use of it. The nelly, 2009) and Scandinavian languages (Nord- module works by selecting a translation based on falk, 2009). The whole platform, both programs sentence context. For example, for the ambigu- and data, is free and open source and all the soft- ous word bóndi ’farmer’ or ’husband’, the default ware and data for the supported language pairs is 5 translation is left as ’farmer’, but a lexical selec- available for download from the project website . tion rule chooses the translation of ’husband’ if The Apertium platform consists of the following a possessive pronoun is modifying it. While the main modules: current lexical selection rules have been written by A morphological analyser: Performs token- hand, work is ongoing to generate them automati- • isation and morphological analysis which for cally with machine learning techniques. a given surface form returns all of the possible 3 IceNLP lexical forms (analyses) of the word. IceNLP is an open source6 NLP toolkit for pro- A part-of-speech (PoS) tagger: The HMM- • cessing and analysing Icelandic texts. Currently, based PoS tagger, given a sequence of the main modules of IceNLP are the following: 1See explanations of WER and PER in Section 5. 2http://translate.google.com A tokeniser. This module performs both 3 • http://www.tungutorg.is/ word tokenisation and sentence segmenta- 4 We do not have information on development months for the tion. is-en part of Google Translate. 5http://www.apertium.org 6http://icenlp.sourceforge.net 218 IceMorphy: A morphological analyser Wallenberg, 2008; Loftsson et al., 2009) has • (Loftsson, 2008). The program provides the shown that the morphological complexity of the tag profile (the ambiguity class) for known Icelandic language, and the relatively small train- words by looking up words in its dictionary. ing corpus in relation to the size of the tagset, is to The dictionary is derived from the Icelandic blame for a rather low tagging accuracy (compared Frequency Dictionary (IFD) corpus (Pind et to related languages). Taggers that are purely al., 1991). The tag profile for unknown based on machine learning (including HMM tri- words, i.e. words not known to the dictionary, gram taggers) have not been able to produce high is guessed by applying rules based on mor- accuracy when tagging Icelandic text (with the ex- phological suffixes and endings. IceMorphy ception of Dredze and Wallenberg (2008)). The does not generate word forms, it only carries current state-of-the-art tagging accuracy of 92.5% out analysis. is obtained by applying a hybrid approach, inte- grating TriTagger into IceTagger (Loftsson et al., IceTagger: A linguistic rule-based PoS • 2009). tagger (Loftsson, 2008). The tagger produces disambiguated morphosyntactic tags from the 4 Apertium-IceNLP tagset of the IFD corpus. The tagger uses Ice- Morphy for morphological analysis and ap- We decided to experiment with using IceMorphy, plies both local rules and heuristics for dis- Lemmald, IceTagger and IceParser in the Aper- ambiguation. tium pipeline. Note that since Apertium is based on a collection of modules that are connected by TriTagger: A statistical PoS tagger. This • clean interfaces in a pipeline (following the Unix trigram tagger is a re-implemenation of philosophy (Forcada et al., 2009)) it is relatively the well-known HMM tagger described by easy to replace modules or add new ones. Figure 1 Brants (2000). It is trained on the IFD cor- shows the Apertium-IceNLP pipeline.
Recommended publications
  • "Machine Translation Evaluation Through Post-Editing…"
    ©inTRAlinea & Anna Fernández Torné (2016). "Machine translation evaluation through post-editing measures in audio description", inTRAlinea Vol. 18. Permanent URL: http://www.intralinea.org/archive/article/2200 inTRAlinea [ISSN 1827-000X] is the online translation journal of the Department of Interpreting and Translation (DIT) of the University of Bologna, Italy. This printout was generated directly from the online version of this article and can be freely distributed under the following Creative Commons License. Machine translation evaluation through post-editing measures in audio description By Anna Fernández Torné (Universitat Autònoma de Barcelona, Spain) Abstract & Keywords English: The number of accessible audiovisual products and the pace at which audiovisual content is made accessible need to be increased, reducing costs whenever possible. The implementation of different technologies which are already available in the translation field, specifically machine translation technologies, could help reach this goal in audio description for the blind and partially sighted. Measuring machine translation quality is essential when selecting the most appropriate machine translation engine to be implemented in the audio description field for the English-Catalan language combination. Automatic metrics and human assessments are often used for this purpose in any specific domain and language pair. This article proposes a methodology based on both objective and subjective measures for the evaluation of five different and free online machine translation systems. Their raw machine translation outputs and the post-editing effort that is involved are assessed using eight different scores. Results show that there are clear quality differences among the systems assessed and that one of them is the best rated in six out of the eight evaluation measures used.
    [Show full text]
  • A Finite-State Morphological Analyser for Sindhi
    A Finite-State Morphological Analyser for Sindhi Raveesh Motlani1, Francis M. Tyers2 and Dipti M. Sharma1 1FC Kohli Center on Intelligent Systems (KCIS), International Institute of Information Technology Hyderabad, Telangana, India 2 HSL-fakultehta, UiT Norgga árktalaš universitehta N-9019 Romsa [email protected], [email protected], [email protected] Abstract Morphological analysis is a fundamental task in natural-language processing, which is used in other NLP applications such as part-of-speech tagging, syntactic parsing, information retrieval, machine translation, etc. In this paper, we present our work on the development of free/open-source finite-state morphological analyser for Sindhi. We have used Apertium’s lttoolbox as our finite-state toolkit to implement the transducer. The system is developed using a paradigm-based approach, wherein a paradigm defines all the word forms and their morphological features for a given stem (lemma). We have evaluated our system on the Sindhi Wikipedia, which is a freely-available large corpus of Sindhi and achieved a reasonable coverage of about 81% and a precision of over 97%. Keywords: Sindhi, Morphological Analysis, Finite-State Machines [ɓ] ٻ [ɲ] ڃ [ŋ] ڱ Introduction .1 [ɠ] [ʄ] [bʱ] ڀ ڄ ڳ Morphology describes the internal structure of words in a [dʱ] ڌ [cʰ] ڇ [k] ڪ language. A morphological analysis of a word involves [ɗ] ڏ [ʈʰ] ٺ [ɳ] ڻ ,describing one or more of its properties such as: gender [ɖ] ڊ [ʈ] ٽ [pʰ] ڦ number, person, case, lexical category, etc. Morphological [ɖʱ] ڍ [tʰ] ٿ [ɽ] ڙ -analysis of a word thus becomes a fundamental and cru cial task in natural-language processing for any language.
    [Show full text]
  • Multi-Script Morphological Transducers and Transcribers for Seven Turkic
    Multi-script morphological transducers and transcribers for seven Turkic languages Jonathan Washington, Francis Tyers, Oğuzhan Kuyrukçu Swarthmore College, Indiana University / Высшая Школа Экономики, Boğaziçi Üniversitesi [email protected], [email protected], [email protected] This paper describes ongoing work to augment morphological transducers for seven Turkic languages with support for multiple scripts each, as well as respective IPA transcription systems. Evaluation demonstrates that our approach yields coverage equivalent to or not much lower than that of the base transducers. Background. A morphological transducer converts between form and analysis, e.g. алмалардан ↔ алма<n><pl> <abl>, where the form-to-analysis task is termed “morphological analysis” and the analysis-to-form task is termed “morphological generation”. Existing Free/Open-Source morphological transducers for Turkic languages (Wash- ington et al. 2020) are implemented in only one orthography, despite a number of the languages being currently written in two or more orthographies, or having a large body of text written in an orthography that was recently switched away from. This paper builds on work in which Cyrillic support was added to a transducer for Crimean Tatar which had been implemented in the Latin script (Tyers et al. 2019). We leverage morphological transducers for Kazakh (imple- mented in the Cyrillic script), Kyrgyz (Cyrillic), Turkmen (Latin), Qaraqalpaq (Latin), Uzbek (Latin), and Uyghur (Perso-Arabic), and add support for analysis and generation in additional scripts that are currently or have recently been used for the languages. Specifically, we add Cyrillic support to Turkmen, Qaraqalpaq, Uzbek, and Uyghur transducers; Perso-Arabic support to the Kazakh and Kyrgyz transducers; Latin script to the Kazakh and Uyghur transducers; and IPA support to all of them.
    [Show full text]
  • Developing Morph-Analyzer for Urdu Using Apertium: Some Issues
    Developing Morph-Analyzer for Urdu Using Apertium: Some issues Shahid Mushataq Bhat [email protected] Linguistic Data Consortium for Indian Languages (LDCIL) CIIL, Mysore Content: Introduction Morphological features of Urdu Apertium (LT-toolbox): Some background Computing Noun-morphology using LT Toolbox Split-Orthography of Urdu Conclusion Introduction: Automatic morphological analysis is the fundamental task in NLP that can be employed in enhancing the accuracy of POS-taggers, Chunkers, Parsers and Information retrieval systems. Computational morphology models the internal structure of words i-e the way; words are built out of minimal units called morphemes. Most of natural languages construct words by concatenating morphemes together in strict orders. Such Concatenative morphotactics is highly productive, particularly, in agglutinative languages like Tamil, Kannada, Manipuri, etc but in some languages like Hebrew and Arabic (Semitic languages) infixation is the main morphological operation (instead of concatenation), constituting Non-Concatenative (Templatic or Root & Pattern) morphology. Continues …… Beyond this Concatenative and Non-Concatenative polarity, Urdu nouns (unlike nouns of other Indian Languages) show the interplay of both types of morphologies. So, morphological structure of Urdu like Tagalog (a language of Philippines) can’t be computed adequately unless dual nature of its morphology is not taken into account. “The morphotactic limitations of the traditional implementations are the direct result of relying solely
    [Show full text]
  • Estudio Comparativo De Tres
    TRABAJO DE FIN DE GRADO FACULTAD DE CIENCIAS HUMANAS Y SOCIALES GRADO EN TRADUCCIÓN E INTERPRETACIÓN ESTUDIO COMPARATIVO DE TRES TRADUCTORES AUTOMÁTICOS EN LÍNEA : DEEP L, YANDEX Y APERTIUM Autora: Mónica Adán Soriano Directora: Profesora Mª Luisa Romana García Madrid, junio 2019 Resumen : Este trabajo tiene la finalidad de comparar traductores automáticos en línea para así determinar cuál es el traductor más avanzado para un texto técnico. Para ello, primero, habrá un análisis de la evolución histórica de la traducción automática y sus usos, así como los programas desarrollados para ello. Se explicarán además los diferentes sistemas de traducción automática que existen y cómo funcionan. En la parte experimental, se escogerá un texto de carácter técnico en español y de oraciones que supongan un reto para un traductor y se procesará en los tres traductores automáticos escogidos según su modalidad para realizar una traducción al inglés. Una vez obtenidas las respuestas, se analizarán los errores cometidos por los traductores automáticos, concluyendo así con los errores más comunes y el mejor traductor automático en línea así como los usos que se le pueden dar. Palabras clave: traducción automática, sistemas basados en estadística, Yandex, sistemas neuronales, DeepL, sistemas basados en reglas, Apertium, errores de traducción. Abstract : The aim of this dissertation is to do a comparative research analysis on three automatic translation programs available on the internet to determine which is the most advanced for a technical text. First, we will do an analysis on the historic evolution of automatic translation and the several uses given to it, as well as the diverse programs developed for it and how they work.
    [Show full text]
  • The Apertium Bilingual Dictionaries on the Web of Data
    Undefined 0 (0) 1 1 IOS Press The Apertium Bilingual Dictionaries on the Web of Data Jorge Gracia a;∗, Marta Villegas b, Asunción Gómez-Pérez a, and Núria Bel b, a Ontology Engineering Group, Universidad Politécnica de Madrid Campus de Montegancedo s/n Boadilla del Monte 28660 Madrid. Spain E-mail: {jgracia,asun}@fi.upm.es b Institut Universitari Linguistica Aplicada, Universitat Pompeu Fabra Roc Boronat, 138 08018 Barcelona. Spain E-mail: {marta.villegas,nuria.bel}@upf.edu Abstract. Bilingual electronic dictionaries contain collections of lexical entries in two languages, with explicitly declared trans- lation relations between such entries. Nevertheless, they are typically developed in isolation, in their own formats and accessible through proprietary APIs. In this paper we propose the use of Semantic Web techniques to make translations available on the Web to be consumed by other semantic enabled resources in a direct manner, based on standard languages and query means. In particular, we describe the conversion of the Apertium family of bilingual dictionaries and lexicons into RDF (Resource Descrip- tion Framework) and how their data have been made accessible on the Web as linked data. As result, all the converted dictionaries (many of them covering under-resourced languages) are connected among them and can be easily traversed from one to another to obtain, for instance, translations between language pairs not originally connected in any of the original dictionaries. Keywords: linguistic linked data, multilingualism, Apertium, bilingual dictionaries, lexicons, lemon, translation 1. Introduction In this article we will focus on the case of electronic bilingual dictionaries as a particular type of languages The publication of bilingual and multilingual lan- resources.
    [Show full text]
  • Apertium: a Free/Open-Source Platform for Machine Translation and Basic Language Technology Mikel L
    Apertium: A free/open-source platform for machine translation and basic language technology www.apertium.org Mikel L. Forcada1 Francis M. Tyers2,3 1Departament de Llenguatges i Sistemes Inform`atics, Universitat d’Alacant, E-03690 Sant Vicent del Raspeig (Spain) 2Department of Linguistics, Indiana University, IN 47405, U.S.A. 3 School of Linguistics, Higher School of Economics, Moscow, Russia [email protected], [email protected] Apertium components Languages and language pairs Since 2005, Apertium provides: 1. A free/open-source, modular, shallow/deep transfer, language-independent Language data is encoded mostly in XML, but some language pairs contain data machine translation engine with: text format management, finite-state lexical encoded in other text-based formats. processing, statistical and constraint-based lexical disambiguation, Stable language pairs (bilingual data) include: discontiguous multi-word assembly and disassembly, anaphora resolution, and shallow structural transfer based on finite-state pattern matching. afr hbs bel crh kaz ind mlt pol urd 2. Free/open-source linguistic data in well-specified XML formats for a variety of languages and language pairs nld bre epo eus cym mkd slv isl rus tur tat zsm ara szl hin 3. Free/open-source tools: compilers to turn linguistic data into an efficient fra oci eng bul swe ukr form used by the engine and software to learn disambiguation or translation spa dan rules from corpora. Interfaces with external technologies: HFST (for some morphologically-rich arg ast por ron sme nno languages), VISL CG-3 (for rule-based lexical disambiguation). cat glg nob Machine translation but not only! ita srd For many of these languages, there are separate monolingual packages.
    [Show full text]
  • A Rule-Based Shallow-Transfer Machine Translation System for Scots and English
    A Rule-based Shallow-transfer Machine Translation System for Scots and English Gavin Abercrombie [email protected] Abstract An open-source rule-based machine translation system is developed for Scots, a low-resourced minor language closely related to English and spoken in Scotland and Ireland. By concentrating on translation for assimilation (gist comprehension) from Scots to English, it is proposed that the development of dictionaries designed to be used within the Apertium platform will be sufficient to produce translations that improve non-Scots speakers understanding of the language. Mono- and bilingual Scots dictionaries are constructed using lexical items gathered from a variety of resources across several domains. Although the primary goal of this project is translation for gisting, the system is evaluated for both assimilation and dissemination (publication-ready translations). A variety of evaluation methods are used, including a cloze test undertaken by human volunteers. While evaluation results are comparable to, and in some cases superior to, those of other language pairs within the Apertium platform, room for improvement is identified in several areas of the system. Keywords: Scots, machine translation, assimilation 1. Introduction (‘what did he do that for’). There are also notable differ- The Scots language is spoken by over 1.5 million people ences to English in negation e.g. huvnae (‘have not’), din- in Scotland1 and a further 140,000 in Ireland,2 but is low- nae (‘do not’), and the double modal e.g. he’ll can dae that resourced in terms of technology - as far as I am aware there (‘he’ll be able to do that’) (Beal, 1997).
    [Show full text]
  • Proceedings of the 21St Annual Conference of the European Association for Machine Translation
    Proceedings of the 21st Annual Conference of the European Association for Machine Translation 28{30 May 2018 Universitat d'Alacant Alacant, Spain Edited by Juan Antonio P´erez-Ortiz Felipe S´anchez-Mart´ınez Miquel Espl`a-Gomis Maja Popovi´c Celia Rico Andr´eMartins Joachim Van den Bogaert Mikel L. Forcada Organised by research group The papers published in this proceedings are |unless indicated otherwise| covered by the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 International (CC-BY-ND 3.0). You may copy, distribute, and transmit the work, provided that you attribute it (au- thorship, proceedings, publisher) in the manner specified by the author(s) or licensor(s), and that you do not use it for commercial purposes. The full text of the licence may be found at https://creativecommons.org/licenses/by-nc-nd/3.0/deed.en. c 2018 The authors ISBN: 978-84-09-01901-4 Rule-based machine translation from Kazakh to Turkish Sevilay Bayatli Sefer Kurnaz Dept. Elec. and Computer Engineering Dept. Elec. and Computer Engineering Altınbaş Üniversitesi Altınbaş Üniversitesi [email protected] [email protected] Ilnar Salimzianov Jonathan North Washington Francis M. Tyers School of Science and Technology Linguistics Deptartment School of Linguistics Nazarbayev University Swarthmore College Higher School of Economics [email protected] [email protected] [email protected] Abstract all the dictionary entries and transfer rules. Con- versely, in a corpus-based/statistical MT approach This paper presents a shallow-transfer ma- (Koehn, 2010), no such effort is required as the sys- chine translation (MT) system for translat- tem can be automatically built from parallel corpora ing from Kazakh to Turkish.
    [Show full text]
  • Apertium: a Rule-Based Machine Translation Platform
    Apertium: A rule-based machine translation platform Francis M. Tyers HSL-fakultehta UiT Norgga árktalaš universitehta 9018 Romsa (Norway) 28th September 2014 1 / 41 Outline Introduction Design Development Status Teaching Courses Google Summer of Code Google Code-in Research New language pairs Applying unsupervised methods Hybrid systems Future work and challenges Challenges Plans Collaboration 1 / 41 History I 2004 — Spain gets a new government which launches a call for proposals to build machine translation systems for the languages of Spain I The Universitat d’Alacant (UA), in consortium with EHU, UPC, UVigo, Eleka, Elhuyar (Basque Country) and Imaxin (Galicia) get funded to develop two MT systems: I Apertium: Spanish–Catalan, Spanish–Galician I Matxin: Spanish!Basque I Apertium was not built from scratch, but was rather a rewrite of two existing closed-source systems which had been built by UA 2 / 41 Focus I Marginalised: Languages which are on the periphery either societally or technologically (from Breton to Bengali). I Lesser-resourced: Languages for which few free/open-source language resources exist. I Closely-related: Languages which are suited to shallow-transfer machine translation. 3 / 41 Translation philosophy I Build on word-for-word translation I Avoid complex linguistic formalisms I We like saying “only secondary-school linguistics required”1 I Intermediate representation based on morphological information I Transformer-style systems I Analyse the source language (SL), then I Apply rules to ‘transform’ the SL to the TL 1Otte, P. and Tyers, F. (2011) “Rapid rule-based machine translation between Dutch and Afrikaans”. Proceedings of the 16th Annual Conference of the European Association of Machine Translation 4 / 41 Pipeline SL deformatter text morph.
    [Show full text]
  • The Apertium Bilingual Dictionaries on the Web of Data
    Semantic Web 0 (0) 1 1 IOS Press The Apertium Bilingual Dictionaries on the Web of Data Editor(s): Philipp Cimiano, Universität Bielefeld, Germany Solicited review(s): Roberto Navigli, Sapienza University of Rome, Italy; Bettina Klimek, University of Leipzig, Germany; Sebastian Hellmann, University of Leipzig, Gemany; one anonymous reviewer Jorge Gracia a;∗, Marta Villegas b, Asunción Gómez-Pérez a, and Núria Bel b, a Ontology Engineering Group, Universidad Politécnica de Madrid Campus de Montegancedo s/n Boadilla del Monte 28660 Madrid. Spain E-mail: {jgracia,asun}@fi.upm.es b Institut Universitari Linguistica Aplicada, Universitat Pompeu Fabra Roc Boronat, 138 08018 Barcelona. Spain E-mail: {marta.villegas,nuria.bel}@upf.edu Abstract. Bilingual electronic dictionaries contain collections of lexical entries in two languages, with explicitly declared trans- lation relations between such entries. Nevertheless, they are typically developed in isolation, in their own formats and accessible through proprietary APIs. In this paper we propose the use of Semantic Web techniques to make translations available on the Web to be consumed by other semantic enabled resources in a direct manner, based on standard languages and query means. In particular, we describe the conversion of the Apertium family of bilingual dictionaries and lexicons into RDF (Resource De- scription Framework) and how their data have been made accessible on the Web as linked data. As a result, all the converted dictionaries (many of them covering under-resourced languages) are connected among them and can be easily traversed from one to another to obtain, for instance, translations between language pairs not originally connected in any of the original dictionaries.
    [Show full text]
  • New Inflectional Lexicons and Training Corpora for Improved
    New Inflectional Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Croatian and Serbian Nikola Ljubesiˇ c´∗ Filip Klubickaˇ ∗ Zeljkoˇ Agic´y Ivo-Pavao Jazbecz * Dept. of Information and Communication Sciences, University of Zagreb Ivana Luciˇ ca´ 3, HR-10000 Zagreb, Croatia y Center for Language Technology, University of Copenhagen Njalsgade 140, 2300 Copenhagen S, Denmark z Institute of Croatian Language and Linguistics Ulica Republike Austrije 16, HR-10000 Zagreb, Croatia fnljubesi,[email protected] [email protected] [email protected] Abstract In this paper we present newly developed inflectional lexcions and manually annotated corpora of Croatian and Serbian. We introduce hrLex and srLex—two freely available inflectional lexicons of Croatian and Serbian—and describe the process of building these lexicons, supported by supervised machine learning techniques for lemma and paradigm prediction. Furthermore, we introduce hr500k, a manually annotated corpus of Croatian, 500 thousand tokens in size. We showcase the three newly developed resources on the task of morphosyntactic annotation of both languages by using a recently developed CRF tagger. We achieve best results yet reported on the task for both languages, beating the HunPos baseline trained on the same datasets by a wide margin. Keywords: inflectional lexicon, morphosyntactic annotation, Croatian, Serbian 1. Introduction expansion or enrichment. Snajderˇ et al. (2008) provide an- other line of work on Croatian inflectional lexica, but the In this paper we introduce hrLex and srLex, freely available resulting resource is not freely available. morphological lexicons of Croatian and Serbian. We de- For Serbian the SrpMD dictionary (Krstev, 2008), 85,721 scribe the process of building these lexicons which includes lemmas in size, is published under a non-commercial li- supervised machine learning techniques for lemma and cense and indexed on Meta-Share, but is not available for paradigm candidate ranking of non-covered words to en- download.
    [Show full text]