Developing Prototypes for Machine Translation Between Two Sámi Languages

Total Page:16

File Type:pdf, Size:1020Kb

Developing Prototypes for Machine Translation Between Two Sámi Languages Developing Prototypes for Machine Translation between Two Sámi Languages Francis M. Tyers Linda Wiechetek Trond Trosterud Departament de Llenguatges Giellatekno, Giellatekno, i Sistemes Informàtics, Universitetet i Tromsø, Universitetet i Tromsø, Universitat d’Alacant Norway Norway E-03071 Alacant (Spain) [email protected] [email protected] [email protected] Abstract north of Norway and Sweden, North Sámi also This paper describes the development of in Finland. North Sámi has between 15,000 and two prototype systems for machine trans- 25,000 speakers, while Lule Sámi has less than lation between North Sámi and Lule Sámi. 2,000 speakers. Experiments were conducted in rule-based The Sámi proto-language was originally an ag- machine translation (RBMT), using the glutinative language, but North and Lule Sámi Apertium platform, and statistical ma- have developed features known from inflective lan- chine translation (SMT) using the Moses- guages (case/number combinations are often ex- decoder. The experiments show that both pressed by one suffix only, certain morphologi- approaches have their advantages and dis- cal distinctions are expressed by means of conso- advantages, and that they can both make nant gradation (i.e. a non-segmental process) only, use of pre-existing linguistic resources. etc.). The main objective with the development of the 1 Introduction prototype rule-based system was to evaluate how In this paper we describe the development of two well existing resources could be re-used, and if the prototype machine translation systems between shallow-transfer approach was suited to languages two Sámi languages, North Sámi (sme) and Lule with more agglutinative typologies. Sámi (smj), one rule-based (Apertium), and one statistical (Moses). There are other systems which 1.2 A typology of MT systems for minority have been developed with marginalised languages languages in mind (e.g. (Lavie, 2008)), but, as of writing, Minority language speakers typically differ from these were not available under an open-source li- the majority in being bilingual, the minority speaks cence and thus could not be applied to the task at the language of the majority, but not vice-versa. hand. The content will be split into several sec- This has some implications for the requirements tions. The first section will give a general overview society will put to machine translation systems. of the languages in question, and sketch a typol- A majority to minority language system must be ogy of MT scenarios for minority languages. The of high quality, so high that post-editing the output next sections will describe the two machine trans- is faster than translating from scratch. The goal is lation strategies in some detail and will outline to produce well-formed text, not to understand the how the existing language technology was able to content, since the minority language users will pre- be re-used and integrated. We will follow this by fer the original to a bad translation. A minority to a short evaluation and then some discussion and majority language system, on the other hand, will future work. be useful even as a gist system, answering vital questions such as “what are they writing about me 1.1 The languages in the minority language newspaper?”. The sys- Both North Sámi and Lule Sámi belong to the tems presented here are minority to minority lan- Finno-Ugric language family and are spoken in the guage systems. North and Lule Sámi are mutu- c 2009 European Association for Machine Translation. ally intelligible, and also in this context a gist sys- Proceedings of the 13th Annual Conference of the EAMT, pages 120–127, Barcelona, May 2009 120 <pardef n="gu/olli__N"> tem will not be that interesting. The importance of <e> the system lies in its ability to produce text. Here, <p> North Sámi is the larger language, possessing close <l>liid</l> <r>olli<s n="N"/><s n="Pl"/> to a full curriculum of school textbooks. A high- <s n="Acc"/></r> quality MT system would help produce the same </p> for Lule Sámi, and moreover from the closely re- </e> ... lated North Sámi than from Norwegian. The same </pardef> situation may found for many language communi- ties. Figure 1: Section of inflectional paradigm for gu/olli__N 2 Rule-based machine translation 2.1 Apertium tween <l></l> liid the generated ending and the Apertium is an open-source platform for creating items between <r></r> the analysis including the rule-based machine translation systems. It was ini- lemma and the morphological tags. tially designed for closely-related languages, but The Divvun and Giellatekno Sámi language has also been adapted to work better for less- technology projects2 use finite-state transducers related languages. The engine largely follows a for the morphological analyser and closed- shallow-transfer approach. Finite-state transducers source finite-state tools from Xerox (Beesley and (Garrido-Alenda and Forcada, 2002) and (Roche Karttunen, 2003). They tools handle two-level and Schabes, 1997) are used for lexical processing, morphology model with twolc (two-level com- first-order hidden Markov models (HMM) are used piler) for morphophonological analysis together for part-of-speech tagging, and multi-stage finite- with lexical tools in a single transducer, con- state based chunking for structural transfer (For- sonant gradation, diphthong simplification and cada, 2006). The original shallow-transfer Aper- compounding are handled by two-level rules. tium system consists of a de-formatter, a morpho- Consonant gradation and diphthong simplification logical analyser, the categorial disambiguator, the of the noun guolli are handled in the following structural and lexical transfer module, the morpho- way. guolli is listed in the root lexicon with lemma logical generator, the post generator and the re- and continuation lexicon AIGI and redirected to formatter. the sublexicon AIGI. 2.1.1 Analysis and generation For the analysis and generation, we used exist- guolli AIGI "fish N" ; ing finite-state transducers for the two languages.1 LEXICON AIGI !Bisyll. V-Nouns. lttoolbox has been widely used to model ro- +N+Sg+Acc:%>X4 K ; +N:%>X5 GODII- ; ! weak gr dipth simpl mance language morphology, and although it has ... been used to model the morphology of other lan- guages with complex morphology (e.g. Basque), it From there it is redirected to a further sublex- is not ideal for these languages. lttoolbox is lack- icon GODII- which redirects it to the sublexicon ing features for dealing with stem-internal varia- GODII-, which provides the plural accusative anal- tion, diphthong simplification, and compounding. ysis. North Sámi word forms involve both conso- LEXICON GODII- nant gradation, diphthong simplification and com- +Pl+Acc:jd9 K ; pounding. The North Sámi noun guolli (‘fish’) al- At the same time, a two-level rule handles diph- ternates between -ll- (strong stage) and -l- (weak thong simplification when encountering the dia- stage). critical mark X5 by removing the second vowel (e Additionally, one has to deal with diphthong o a) in a diphthong (ie uo ea) if the suffix contains simplification, the diphthong uo changes into a an i. monophtong u in e.g. accusative plural guliid. Vx:0 <=> Vow _ Cns:+ i (...) X5: ; In the Apertium lexicon, guolli is represented as where Vx in (e o a) ; in figure 1. gu represents the stem, the item be- 2To be found on http://www.divvun.no/index. 1http://giellatekno.uit.no/ html and http://giellatekno.uit.no/. 121 Consonant gradation is handled in another rule ij ij+V+Neg+Prs+Sg3 where a consonant (f l m n r s . ) is removed ittjij ij+V+Neg+Prt+Sg3 between a vowel, an identical consonant, another Both for generation and analysis that means that vowel, and a weak grade triggering diacritial mark one has to find a possibility to account for the (the rule is slightly simplified, noted by . ). ‘missing’ tag in North Sámi. ‘Missing’ means here Cx:0 <=> Vow: _ Cy Vow (...) WeG: ; the lack of tag specification for the temps (tempus) where Cx in (f l m n r s ...) variable in the transfer files. Cy in (f l m n r s ...) A number of multiword expressions differ from A general difficulty for generation and analysis each other in SL and TL. While in North Sámi are inconsistent tagsets in SL and TL. While verbs gii beare (‘whoever’) has inner inflection, the Lule are specified with regard to transitivity (V TV, V Sámi vajku guhti does not. The initial pronoun gii IV) for North Sámi, they were not specified in the corresponds to the second component guhti in Lule Lule Sámi dictionary (only V). Another matter of Sámi. choice and convenience is the degree of lexicali- The last type of generation modification hap- sation as in the case of derived verbs. The North pens in a separate step. Orthographic variants and Sámi verbform gohˇcoduvvo (‘he/she is called’) ei- contractions are handled by the postgenerator. The ther goes back to the form gohˇcˇcut (‘order’) or to Lule Sámi copula liehket (’to be’) has three forms gohˇcodit (‘call, name’), which is derived from go- for the tag combination liehket+V+Ind+Prs+Sg3, hˇcˇcut but to some extent lexicalised. le, la, l. While le and la are interchangable vari- gohcˇcut+V+TV+Der1+Der/d+Vˇ ants, l is a shortened form of la after wordforms +Der2+Der/PassL+V+Ind+Prs+Sg3 that end in a vowel. The postgeneration lexicon gohcodit+V+TV+Der2+Der/PassL+V+Ind+Prs+Sg3ˇ specifies this change and outputs the correct form. gåhtjudit+V+TV+Der1+Der/Pass+V+Ind+Prs+Sg3 2.1.2 Disambiguation (Constraint Grammar) In Lule Sámi, gåhtjuduvvá only gets the analy- Disambiguation of morphological and shallow sis with the lexicalised verb gåhtjudit as a lemma.
Recommended publications
  • Annotation Schemes in North Sami Dependency Parsing
    Annotation schemes in North Sámi dependency parsing Mariya Sheyanova Fundamental and Applied Linguistics Higher School of Economics Moscow, Russia [email protected] Francis M. Tyers Giela ja kultuvrra instituhtta UiT Norgga árktalaš universitehta N-9018 Romsa, Norway [email protected] Abstract In this paper we describe a comparison of two annotation schemes for de- pendency parsing of North Sámi, a Finno-Ugric language spoken in the north of Scandinavia and Finland. The two annotation schemes are the Giellatekno (GT) scheme which has been used in research and applications for the Sámi languages and Universal Dependencies (UD) which is a cross-lingual scheme aiming to unify annotation stations across languages. We show that we are able to deterministi- cally convert from the Giellatekno scheme to the Universal Dependencies scheme without a loss of parsing performance. While we do not claim that either scheme is a priori a more adequate model of North Sámi syntax, we do argue that the choice of annotation scheme is dependent on the intended application. This work is licensed under a Creative Commons Attribution–NoDerivatives 4.0 International Licence. Licence details: http://creativecommons.org/licenses/by-nd/4.0/ 1 66 The 3rd International Workshop for Computational Linguistics of Uralic Languages, pages 66–75, St. Petersburg, Russia, 23–24 January 2017. c 2017 Association for Computational Linguistics 1 Introduction Dependency parsing is an important step in many applications of natural language processing, such as information extraction, machine translation, interactive language learning and corpus search interfaces. There are a number of approaches to depen- dency parsing.
    [Show full text]
  • "Machine Translation Evaluation Through Post-Editing…"
    ©inTRAlinea & Anna Fernández Torné (2016). "Machine translation evaluation through post-editing measures in audio description", inTRAlinea Vol. 18. Permanent URL: http://www.intralinea.org/archive/article/2200 inTRAlinea [ISSN 1827-000X] is the online translation journal of the Department of Interpreting and Translation (DIT) of the University of Bologna, Italy. This printout was generated directly from the online version of this article and can be freely distributed under the following Creative Commons License. Machine translation evaluation through post-editing measures in audio description By Anna Fernández Torné (Universitat Autònoma de Barcelona, Spain) Abstract & Keywords English: The number of accessible audiovisual products and the pace at which audiovisual content is made accessible need to be increased, reducing costs whenever possible. The implementation of different technologies which are already available in the translation field, specifically machine translation technologies, could help reach this goal in audio description for the blind and partially sighted. Measuring machine translation quality is essential when selecting the most appropriate machine translation engine to be implemented in the audio description field for the English-Catalan language combination. Automatic metrics and human assessments are often used for this purpose in any specific domain and language pair. This article proposes a methodology based on both objective and subjective measures for the evaluation of five different and free online machine translation systems. Their raw machine translation outputs and the post-editing effort that is involved are assessed using eight different scores. Results show that there are clear quality differences among the systems assessed and that one of them is the best rated in six out of the eight evaluation measures used.
    [Show full text]
  • A Finite-State Morphological Analyser for Sindhi
    A Finite-State Morphological Analyser for Sindhi Raveesh Motlani1, Francis M. Tyers2 and Dipti M. Sharma1 1FC Kohli Center on Intelligent Systems (KCIS), International Institute of Information Technology Hyderabad, Telangana, India 2 HSL-fakultehta, UiT Norgga árktalaš universitehta N-9019 Romsa [email protected], [email protected], [email protected] Abstract Morphological analysis is a fundamental task in natural-language processing, which is used in other NLP applications such as part-of-speech tagging, syntactic parsing, information retrieval, machine translation, etc. In this paper, we present our work on the development of free/open-source finite-state morphological analyser for Sindhi. We have used Apertium’s lttoolbox as our finite-state toolkit to implement the transducer. The system is developed using a paradigm-based approach, wherein a paradigm defines all the word forms and their morphological features for a given stem (lemma). We have evaluated our system on the Sindhi Wikipedia, which is a freely-available large corpus of Sindhi and achieved a reasonable coverage of about 81% and a precision of over 97%. Keywords: Sindhi, Morphological Analysis, Finite-State Machines [ɓ] ٻ [ɲ] ڃ [ŋ] ڱ Introduction .1 [ɠ] [ʄ] [bʱ] ڀ ڄ ڳ Morphology describes the internal structure of words in a [dʱ] ڌ [cʰ] ڇ [k] ڪ language. A morphological analysis of a word involves [ɗ] ڏ [ʈʰ] ٺ [ɳ] ڻ ,describing one or more of its properties such as: gender [ɖ] ڊ [ʈ] ٽ [pʰ] ڦ number, person, case, lexical category, etc. Morphological [ɖʱ] ڍ [tʰ] ٿ [ɽ] ڙ -analysis of a word thus becomes a fundamental and cru cial task in natural-language processing for any language.
    [Show full text]
  • Multi-Script Morphological Transducers and Transcribers for Seven Turkic
    Multi-script morphological transducers and transcribers for seven Turkic languages Jonathan Washington, Francis Tyers, Oğuzhan Kuyrukçu Swarthmore College, Indiana University / Высшая Школа Экономики, Boğaziçi Üniversitesi [email protected], [email protected], [email protected] This paper describes ongoing work to augment morphological transducers for seven Turkic languages with support for multiple scripts each, as well as respective IPA transcription systems. Evaluation demonstrates that our approach yields coverage equivalent to or not much lower than that of the base transducers. Background. A morphological transducer converts between form and analysis, e.g. алмалардан ↔ алма<n><pl> <abl>, where the form-to-analysis task is termed “morphological analysis” and the analysis-to-form task is termed “morphological generation”. Existing Free/Open-Source morphological transducers for Turkic languages (Wash- ington et al. 2020) are implemented in only one orthography, despite a number of the languages being currently written in two or more orthographies, or having a large body of text written in an orthography that was recently switched away from. This paper builds on work in which Cyrillic support was added to a transducer for Crimean Tatar which had been implemented in the Latin script (Tyers et al. 2019). We leverage morphological transducers for Kazakh (imple- mented in the Cyrillic script), Kyrgyz (Cyrillic), Turkmen (Latin), Qaraqalpaq (Latin), Uzbek (Latin), and Uyghur (Perso-Arabic), and add support for analysis and generation in additional scripts that are currently or have recently been used for the languages. Specifically, we add Cyrillic support to Turkmen, Qaraqalpaq, Uzbek, and Uyghur transducers; Perso-Arabic support to the Kazakh and Kyrgyz transducers; Latin script to the Kazakh and Uyghur transducers; and IPA support to all of them.
    [Show full text]
  • Developing Morph-Analyzer for Urdu Using Apertium: Some Issues
    Developing Morph-Analyzer for Urdu Using Apertium: Some issues Shahid Mushataq Bhat [email protected] Linguistic Data Consortium for Indian Languages (LDCIL) CIIL, Mysore Content: Introduction Morphological features of Urdu Apertium (LT-toolbox): Some background Computing Noun-morphology using LT Toolbox Split-Orthography of Urdu Conclusion Introduction: Automatic morphological analysis is the fundamental task in NLP that can be employed in enhancing the accuracy of POS-taggers, Chunkers, Parsers and Information retrieval systems. Computational morphology models the internal structure of words i-e the way; words are built out of minimal units called morphemes. Most of natural languages construct words by concatenating morphemes together in strict orders. Such Concatenative morphotactics is highly productive, particularly, in agglutinative languages like Tamil, Kannada, Manipuri, etc but in some languages like Hebrew and Arabic (Semitic languages) infixation is the main morphological operation (instead of concatenation), constituting Non-Concatenative (Templatic or Root & Pattern) morphology. Continues …… Beyond this Concatenative and Non-Concatenative polarity, Urdu nouns (unlike nouns of other Indian Languages) show the interplay of both types of morphologies. So, morphological structure of Urdu like Tagalog (a language of Philippines) can’t be computed adequately unless dual nature of its morphology is not taken into account. “The morphotactic limitations of the traditional implementations are the direct result of relying solely
    [Show full text]
  • Estudio Comparativo De Tres
    TRABAJO DE FIN DE GRADO FACULTAD DE CIENCIAS HUMANAS Y SOCIALES GRADO EN TRADUCCIÓN E INTERPRETACIÓN ESTUDIO COMPARATIVO DE TRES TRADUCTORES AUTOMÁTICOS EN LÍNEA : DEEP L, YANDEX Y APERTIUM Autora: Mónica Adán Soriano Directora: Profesora Mª Luisa Romana García Madrid, junio 2019 Resumen : Este trabajo tiene la finalidad de comparar traductores automáticos en línea para así determinar cuál es el traductor más avanzado para un texto técnico. Para ello, primero, habrá un análisis de la evolución histórica de la traducción automática y sus usos, así como los programas desarrollados para ello. Se explicarán además los diferentes sistemas de traducción automática que existen y cómo funcionan. En la parte experimental, se escogerá un texto de carácter técnico en español y de oraciones que supongan un reto para un traductor y se procesará en los tres traductores automáticos escogidos según su modalidad para realizar una traducción al inglés. Una vez obtenidas las respuestas, se analizarán los errores cometidos por los traductores automáticos, concluyendo así con los errores más comunes y el mejor traductor automático en línea así como los usos que se le pueden dar. Palabras clave: traducción automática, sistemas basados en estadística, Yandex, sistemas neuronales, DeepL, sistemas basados en reglas, Apertium, errores de traducción. Abstract : The aim of this dissertation is to do a comparative research analysis on three automatic translation programs available on the internet to determine which is the most advanced for a technical text. First, we will do an analysis on the historic evolution of automatic translation and the several uses given to it, as well as the diverse programs developed for it and how they work.
    [Show full text]
  • The Apertium Bilingual Dictionaries on the Web of Data
    Undefined 0 (0) 1 1 IOS Press The Apertium Bilingual Dictionaries on the Web of Data Jorge Gracia a;∗, Marta Villegas b, Asunción Gómez-Pérez a, and Núria Bel b, a Ontology Engineering Group, Universidad Politécnica de Madrid Campus de Montegancedo s/n Boadilla del Monte 28660 Madrid. Spain E-mail: {jgracia,asun}@fi.upm.es b Institut Universitari Linguistica Aplicada, Universitat Pompeu Fabra Roc Boronat, 138 08018 Barcelona. Spain E-mail: {marta.villegas,nuria.bel}@upf.edu Abstract. Bilingual electronic dictionaries contain collections of lexical entries in two languages, with explicitly declared trans- lation relations between such entries. Nevertheless, they are typically developed in isolation, in their own formats and accessible through proprietary APIs. In this paper we propose the use of Semantic Web techniques to make translations available on the Web to be consumed by other semantic enabled resources in a direct manner, based on standard languages and query means. In particular, we describe the conversion of the Apertium family of bilingual dictionaries and lexicons into RDF (Resource Descrip- tion Framework) and how their data have been made accessible on the Web as linked data. As result, all the converted dictionaries (many of them covering under-resourced languages) are connected among them and can be easily traversed from one to another to obtain, for instance, translations between language pairs not originally connected in any of the original dictionaries. Keywords: linguistic linked data, multilingualism, Apertium, bilingual dictionaries, lexicons, lemon, translation 1. Introduction In this article we will focus on the case of electronic bilingual dictionaries as a particular type of languages The publication of bilingual and multilingual lan- resources.
    [Show full text]
  • Apertium: a Free/Open-Source Platform for Machine Translation and Basic Language Technology Mikel L
    Apertium: A free/open-source platform for machine translation and basic language technology www.apertium.org Mikel L. Forcada1 Francis M. Tyers2,3 1Departament de Llenguatges i Sistemes Inform`atics, Universitat d’Alacant, E-03690 Sant Vicent del Raspeig (Spain) 2Department of Linguistics, Indiana University, IN 47405, U.S.A. 3 School of Linguistics, Higher School of Economics, Moscow, Russia [email protected], [email protected] Apertium components Languages and language pairs Since 2005, Apertium provides: 1. A free/open-source, modular, shallow/deep transfer, language-independent Language data is encoded mostly in XML, but some language pairs contain data machine translation engine with: text format management, finite-state lexical encoded in other text-based formats. processing, statistical and constraint-based lexical disambiguation, Stable language pairs (bilingual data) include: discontiguous multi-word assembly and disassembly, anaphora resolution, and shallow structural transfer based on finite-state pattern matching. afr hbs bel crh kaz ind mlt pol urd 2. Free/open-source linguistic data in well-specified XML formats for a variety of languages and language pairs nld bre epo eus cym mkd slv isl rus tur tat zsm ara szl hin 3. Free/open-source tools: compilers to turn linguistic data into an efficient fra oci eng bul swe ukr form used by the engine and software to learn disambiguation or translation spa dan rules from corpora. Interfaces with external technologies: HFST (for some morphologically-rich arg ast por ron sme nno languages), VISL CG-3 (for rule-based lexical disambiguation). cat glg nob Machine translation but not only! ita srd For many of these languages, there are separate monolingual packages.
    [Show full text]
  • Word Classes and Part-Of-Speech 5 Tagging
    Speech and Language Processing: An introduction to speech recognition, computational linguistics and natural language processing. Daniel Jurafsky & James H. Martin. Copyright c 2006, All rights reserved. Draft of July 30, 2007. Do not cite without permission. WORD CLASSES AND PART-OF-SPEECH 5 TAGGING Conjunction Junction, what’s your function? Bob Dorough, Schoolhouse Rock, 1973 A gnostic was seated before a grammarian. The grammarian said, ‘A word must be one of three things: either it is a noun, a verb, or a particle.’ The gnostic tore his robe and cried, “Alas! Twenty years of my life and striving and seeking have gone to the winds, for I laboured greatly in the hope that there was another word outside of this. Now you have destroyed my hope.’ Though the gnostic had already attained the word which was his purpose, he spoke thus in order to arouse the grammarian. Rumi (1207–1273), The Discourses of Rumi, Translated by A. J. Arberry Dionysius Thrax of Alexandria (c. 100 B.C.), or perhaps someone else (exact au- thorship being understandably difficult to be sure of with texts of this vintage), wrote a grammatical sketch of Greek (a “techne¯”) which summarized the linguistic knowl- edge of his day. This work is the direct source of an astonishing proportion of our modern linguistic vocabulary, including among many other words, syntax, diphthong, PARTS­OF­SPEECH clitic, and analogy. Also included are a description of eight parts-of-speech: noun, verb, pronoun, preposition, adverb, conjunction, participle, and article. Although ear- lier scholars (including Aristotle as well as the Stoics) had their own lists of parts-of- speech, it was Thrax’s set of eight which became the basis for practically all subsequent part-of-speech descriptions of Greek, Latin, and most European languages for the next 2000 years.
    [Show full text]
  • A Rule-Based Shallow-Transfer Machine Translation System for Scots and English
    A Rule-based Shallow-transfer Machine Translation System for Scots and English Gavin Abercrombie [email protected] Abstract An open-source rule-based machine translation system is developed for Scots, a low-resourced minor language closely related to English and spoken in Scotland and Ireland. By concentrating on translation for assimilation (gist comprehension) from Scots to English, it is proposed that the development of dictionaries designed to be used within the Apertium platform will be sufficient to produce translations that improve non-Scots speakers understanding of the language. Mono- and bilingual Scots dictionaries are constructed using lexical items gathered from a variety of resources across several domains. Although the primary goal of this project is translation for gisting, the system is evaluated for both assimilation and dissemination (publication-ready translations). A variety of evaluation methods are used, including a cloze test undertaken by human volunteers. While evaluation results are comparable to, and in some cases superior to, those of other language pairs within the Apertium platform, room for improvement is identified in several areas of the system. Keywords: Scots, machine translation, assimilation 1. Introduction (‘what did he do that for’). There are also notable differ- The Scots language is spoken by over 1.5 million people ences to English in negation e.g. huvnae (‘have not’), din- in Scotland1 and a further 140,000 in Ireland,2 but is low- nae (‘do not’), and the double modal e.g. he’ll can dae that resourced in terms of technology - as far as I am aware there (‘he’ll be able to do that’) (Beal, 1997).
    [Show full text]
  • Apertium-Icenlp: a Rule-Based Icelandic to English Machine Translation System
    Apertium-IceNLP: A rule-based Icelandic to English machine translation system Martha Dís Brandt, Hrafn Loftsson, Francis M. Tyers Hlynur Sigurþórsson School of Computer Science Dept. Lleng. i. Sist. Inform. Reykjavik University Universitat d’Alacant IS-101 Reykjavik, Iceland E-03071 Alacant, Spain {marthab08,hrafn,hlynurs06}@ru.is [email protected] Abstract translate between the source language (SL) and the target language (TL). We describe the development of a pro- SMT has many advantages, e.g. it is data-driven, totype of an open source rule-based language independent, does not need linguistic ex- Icelandic English MT system, based on perts, and prototypes of new systems can by built → the Apertium MT framework and IceNLP, quickly and at a low cost. On the other hand, the a natural language processing toolkit for need for parallel corpora as training data in SMT Icelandic. Our system, Apertium-IceNLP, is also its main disadvantage, because such corpora is the first system in which the whole are not available for a myriad of languages, espe- morphological and tagging component of cially the so-called less-resourced languages, i.e. Apertium is replaced by modules from languages for which few, if any, natural language an external system. Evaluation shows processing (NLP) resources are available. When that the word error rate and the position- there is a lack of parallel corpora, other machine independent word error rate for our pro- translation (MT) methods, such as rule-based MT, totype is 50.6% and 40.8%, respectively. e.g. Apertium (Forcada et al., 2009), may be used As expected, this is higher than the corre- to create MT systems.
    [Show full text]
  • Analysing Constraint Grammar with SAT I L
    T D L P Analysing Constraint Grammar with SAT I L Department of Computer Science and Engineering C U T U G Gothenburg, Sweden 2016 Analysing Constraint Grammar with SAT I L © Inari Listenmaa, 2016. Technical Report 154L ISSN 1652-876X Research groups: Functional Programming / Language Technology Department of Computer Science and Engineering C U T U G SE-412 96 Gothenburg Sweden Telephone +46 (0)31-772 1000 Typeset in Palatino and Monaco by the author using X TE EX Printed at Chalmers Gothenburg, Sweden 2016 Abstract Constraint Grammar (CG) is a robust and language-independent formalism for part-of-speech tagging and shallow parsing. A grammar consists of disambiguation rules for initially am- biguous, morphologically analysed text: the correct analysis for the sentence is attained by removing improper readings from ambiguous words. Wide-coverage Constraint Grammars, containing some thousands of rules, generally achieve very high accuracy for free text: thanks to the flexible formalism, new rules can be written to address even the rarest phenomena, without compromising the general tendencies. Even with a smaller set of rules, CG-based taggers have been competetive with statistical taggers. In addition, CG has been used in more experimental settings: dependency syntax, dialogue system for language learning and anaphora resolution. This thesis presents two contributions to the study of CG. Firstly, we model a parallel variant of CG as a Boolean satisfiability (SAT) problem, and implement a CG-engine by using a SAT-solver. This is attractive for several reasons: formal logic is well-studied, and serves as an abstract language to reason about the properties of CG.
    [Show full text]