The FISKMÖ Project: Resources and Tools for Finnish-Swedish Machine

Total Page:16

File Type:pdf, Size:1020Kb

The FISKMÖ Project: Resources and Tools for Finnish-Swedish Machine Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 3808–3815 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC The FISKMO¨ Project: Resources and Tools for Finnish-Swedish Machine Translation and Cross-Linguistic Research Jorg¨ Tiedemann∗, Tommi Nieminen, Mikko Aulamo∗, Jenna Kanervay, Akseli Leinoy, Filip Gintery, Niko Papula ∗Department of Digital Humanities, yDepartment of Future Technologies, Multilizer University of Helsinki, University of Turku, Espoo fjorg.tiedemann, mikko.aulamog@helsinki.fi, fjmnybl,akeele,figintg@utu.fi, [email protected] Abstract This paper presents FISKMO,¨ a project that focuses on the development of resources and tools for cross-linguistic research and machine translation between Finnish and Swedish. The goal of the project is the compilation of a massive parallel corpus out of translated material collected from web sources, public and private organisations and language service providers in Finland with its two official languages. The project also aims at the development of open and freely accessible translation services for those two languages for the general purpose and for domain-specific use. We have released new data sets with over 3 million translation units, a benchmark test set for MT development, pre-trained neural MT models with high coverage and competitive performance and a self-contained MT plugin for a popular CAT tool. The latter enables offline translation without dependencies on external services making it possible to work with highly sensitive data without compromising security concerns. Keywords: parallel corpora, machine translation 1. Introduction and for the optimisation of domain-specific translation en- gines. Finland is a country with two official languages, Finnish This paper reports the achievements of the project includ- and Swedish. The demand for translation between the two ing the data sets we have collected so far, machine trans- languages is huge and translation efforts produce a lot of lation models we have developed and translation tools that overhead and costs in public administration and organisa- we have released. In particular, we describe tions as well as any service providers in the country. Yet there is no systematic collection of translated data nor any • data provided by governmental organisations and aca- large scale effort to develop translation services and tools demic institutions, for the public and administrative use in the country. For • methods for extracting parallel segments from web- this reason, we started FISKMO¨ 1 (finsk-svensk korpus och crawled data, maskinoversattning)¨ a joint project of the universities of Helsinki and Turku and Kites (an umbrella organization • tools for pre-processing and alignment of parallel data, for the language sector in Finland), currently funded by the • the implementation of a general-purpose on-line trans- Swedish Culture Foundation.2 lation service and The primary goal of the project is to create a massive par- allel corpus of Finnish and Swedish to support linguistic • the development of computer-assisted translation research and machine translation (MT) development. The (CAT) tools with fully integrated machine translation. secondary goal of the project is to establish a public ma- The latter is especially interesting for real-world applica- chine translation service for translation between Finnish tions in which translation efforts need to be done off-line and Swedish. As the quantity and quality of training data and with strict security constraints. We released a plugin is the most important factor in machine translation quality, to the popular SDL Trados Studio CAT tool that integrates the FISKMO¨ translation service can be expected to pro- the MT engine inside the plugin itself making it obsolete to vide translations of higher quality than other public ma- send (potentially sensitive) data to on-line services. More chine translation services, for which less training data is details about this tool will be provided in Section 4.2. available. Our mission is to systematically collect material that has 2. Data Collection been translated from Finnish to Swedish and vice versa in The primary goal of FISKMO¨ is the systematic collection the public sector, in private organisations and at language of Finnish-Swedish parallel data from public as well as pri- service providers in Finland. We intend to publish the data vate sources in order to improve the development of ma- with permissive licenses to increase the impact on on-going chine translation between the two languages. Furthermore, research and development. Nevertheless, we will also in- our project intends to support cross-linguistic research by clude restricted data sets that can be used internally for im- making those data sets available to researchers in transla- provements of the coverage of machine translation models tion studies and general linguistics. Computer-aided lan- guage learning is another area where translation data is be- 1https://blogs.helsinki.fi/fiskmo-project/ coming increasingly useful. Our efforts in collecting data 2https://www.kulturfonden.fi relate to two strategies: 3808 Figure 1: The basic workflow for data collection and machine translation development in the FISKMO¨ project. 1. Collecting data from public and private organisations tools in the workflow of professional translators. The ap- and language service providers. We also ask for gen- pearance of the ”EU Presidency Translator”3 developed in eral data donations from the public. collaboration with the Finnish Prime Minister’s Office and their translation unit shed some new light on the integration 2. Web crawling and parallel segment extraction. We de- of MT in human translation demonstrating the benefits of velop and apply methods for the identification of trans- such tools in the translator’s daily work. lated sentences and documents from unrestricted web The private sector has been an even more challenging seg- crawls. ment despite initial promises to donate data. Finnish lan- guage service providers (LSPs) are generally not using MT The workflow for data collection and its connection to ma- in large scale and thus they are not very experienced in that chine translation development is illustrated in Figure 1. Be- technology. It seems that there is a common believe that low, we give more details about the efforts and procedures the development of machine translation would benefit their that we apply. competitors more than themselves and, therefore, it is bet- 2.1. Collecting data from organizations ter to hold back with handing out data. Another issue is also that language service providers usually do not own the So far the majority of the data received through direct con- translations they have done for their clients. It will be nec- tributions has come from public organizations. Public orga- essary to change the way IPRs are regulated between LSPs nizations in Finland are obliged to give their data for public and their customers but permissions could easily be asked use if there are no good grounds to do otherwise. How- from the clients to improve services that they order from ever, collecting the data is not always straightforward as LSPs. In general , convincing LSPs to see the benefits of their publication can differ greatly. Mostly, organisations translation services that they can incorporate in their own refer to their websites to point to their public data sets. workflow requires the successful demonstration of practi- Unfortunately, mining parallel data from the web comes cal tools. Therefore, it is crucial to develop proper inte- with various caveats as website structures and functional- grations into CAT tools similar to the one we introduce in ities vary. Hence, our goal in FISKMO¨ is to obtain the Sections 4.2.. data directly from the original source in suitable electronic To help potential providers to donate their data, we have formats and with explicit correspondences between trans- created different procedures they can follow to support our lations and their original documents. Furthermore, we ex- efforts. pected translation memories to be in wide use internally to make it easy to incorporate collected translation knowledge Donating translation memories: In this method a trans- in our project. lation memory is donated as is. The translation mem- Unfortunately, the situation is more complicated than ex- ory can be an existing translation memory but some- pected for numerous reasons. In some cases, translation times organizations prefer to create a specific one for units work independently from the publishing unit, which this purpose to better control its contents. In return we makes it difficult to identify translation data that is pub- offer automatically cleaned translation memories and lished openly without further editing and modification. Ad- customized MT engines based on the data if there is ditionally, public data and sensitive data are often not sys- sufficient training material and resources available. tematically separated and translation memories incorporate mixed content. Furthermore, copyrights and intellectual Crawling and filtering with translation memories: property rights (IPRs) may not always be clear for all data This method effectively overcomes many privacy files in larger collections. Those reasons and the fear of concerns. We crawl e.g. organization websites and compromising privacy information make it difficult to con- thus create a monolingual corpus
Recommended publications
  • Translators' Tool
    The Translator’s Tool Box A Computer Primer for Translators by Jost Zetzsche Version 9, December 2010 Copyright © 2010 International Writers’ Group, LLC. All rights reserved. This document, or any part thereof, may not be reproduced or transmitted electronically or by any other means without the prior written permission of International Writers’ Group, LLC. ABBYY FineReader and PDF Transformer are copyrighted by ABBYY Software House. Acrobat, Acrobat Reader, Dreamweaver, FrameMaker, HomeSite, InDesign, Illustrator, PageMaker, Photoshop, and RoboHelp are registered trademarks of Adobe Systems Inc. Acrocheck is copyrighted by acrolinx GmbH. Acronis True Image is a trademark of Acronis, Inc. Across is a trademark of Nero AG. AllChars is copyrighted by Jeroen Laarhoven. ApSIC Xbench and Comparator are copyrighted by ApSIC S.L. Araxis Merge is copyrighted by Araxis Ltd. ASAP Utilities is copyrighted by eGate Internet Solutions. Authoring Memory Tool is copyrighted by Sajan. Belarc Advisor is a trademark of Belarc, Inc. Catalyst and Publisher are trademarks of Alchemy Software Development Ltd. ClipMate is a trademark of Thornsoft Development. ColourProof, ColourTagger, and QA Solution are copyrighted by Yamagata Europe. Complete Word Count is copyrighted by Shauna Kelly. CopyFlow is a trademark of North Atlantic Publishing Systems, Inc. CrossCheck is copyrighted by Global Databases, Ltd. Déjà Vu is a trademark of ATRIL Language Engineering, S.L. Docucom PDF Driver is copyrighted by Zeon Corporation. dtSearch is a trademark of dtSearch Corp. EasyCleaner is a trademark of ToniArts. ExamDiff Pro is a trademark of Prestosoft. EmEditor is copyrighted by Emura Software inc. Error Spy is copyrighted by D.O.G. GmbH. FileHippo is copyrighted by FileHippo.com.
    [Show full text]
  • Roberto Ochoa Crespo Work Experience Video Game Translation & Localization
    Video game translation & localization Roberto Ochoa Crespo Freelance translator specializing in video game localization (EN/DE>ES-es), proofreading and audiovisual translation. June 7th, 1983 Plaza del Peñón 9 9.º C, izq. 28923 / Alcorcón Work experience (Madrid, Spain) 2011- Freelance translator (EN/DE>ES-es) +34 650 72 90 97 [email protected] Role: Video game localization, translation, proofreading and QA testing of in-game texts for clients like Bigpoint, unlocked International, MoGi Group and Kabam. Translation and proofreading of marketing and legal texts. Translation and subtitling of video podcasts. Translation of product descriptions for Amazon Spanish site. ’s Translation of legal and medical texts. Official translation into Spanish of Eoin Ryan comic Space Avalanche. ’s web Sworn translator for the English language, approved by the Spanish Ministry of Foreign Affairs and Cooperation. 2009-2010 Linguaserve Internacionalización de Servicios Post: Terminology Unit Role: Creation and maintenance of multilingual translation glossaries. Writing style guides. 2003-2008 Wildfire Games Post: Linguistic Department for The Last Alliance Role: Creation and translation of terminological lists. Linguistic research. Education 2005-2009 Translation and Interpreting Universidad Autónoma de Madrid Four-year licenciatura degree (second language: English; third language: German). 2001-2005 Advertising and Public Relations Universidad Complutense de Madrid Five-year licenciatura degree. Complementary training 2011 Audiovisual Translation course: TV and film scripts Trágora Formación Translation of scripts and subtitling. 2007 Literary Translation course Leen, luego traducimos They read, therefore we translate ACEtt (Translators of the Spanish Collegial Association of Writers). Other information Languages Spanish: native. English: near-native (TOEFL Internet-Based Test: 112). German: professional working proficiency.
    [Show full text]
  • Table of Contents
    User Guide Copyright © Wordfast, LLC 2019. All rights reserved. Table of Contents Release Notes Summary........................................................................................................................................ 7 New Features....................................................................................................................................................7 Improvements....................................................................................................................................................7 Fixed Issues...................................................................................................................................................... 7 1 About this Guide................................................................................................................................................ 9 Conventions.......................................................................................................................................................9 Typographical............................................................................................................................................ 9 Icons.......................................................................................................................................................... 9 2 About Wordfast Pro......................................................................................................................................... 10 3 Get Started.......................................................................................................................................................
    [Show full text]
  • User Guide for Project Managers and Translators
    User Guide for Project Managers and Translators Copyright © Wordfast, LLC 2014. All rights reserved. TABLE OF CONTENTS About Wordfast Pro 3 help ................................................................................................. 7 Purpose ................................................................................................................................................ 7 Audience .............................................................................................................................................. 7 Organization ......................................................................................................................................... 7 Conventions ......................................................................................................................................... 8 Abbreviations and Acronyms ............................................................................................................... 9 About Wordfast Pro............................................................................................................11 Overview ........................................................................................................................................ 11 Key advantages .............................................................................................................................. 11 Project Manager plug-in workflow .................................................................................................. 11 TXML
    [Show full text]
  • Introducing a Cat Tool to Translate: Wordfast
    Indonesian Journal of English Language Studies Vol. 2, Number 1, February 2016 Introducing a Cat Tool to Translate: Wordfast Fika Apriliana, Ardiyarso Kurniawan, Sandy Ferianda, and Fidelis Chosa Kastuhandani ABSTRACT This article aims at introducing CAT tools to those prospective translators who are familiar with with the tools for the first time. Some of the CAT tools must be paid for while some others are free. This article is to inform the readers about the list of free and paid CAT tools, advantages and disadvantages of those tools. One does not need special training for using a free CAT tool while using the paid CAT tools, one needs some special preparation. This article is going to focus more on Wordfast Pro as the second most widely used CAT tools after SDLTrados. Wordfast Pro is a paid software the functioning of which is based on the creation of a Translation Memory which facilitates and speeds up the translator's work. This article is going to briefly explain the advantages of Wordfast Pro and the steps of using it. The translation example is presented to reveal the different translation results of Wordfast Pro as a paid CAT tool and OmegaT as a free CAT tool. Therefore, the article will facilitate those who intend to know more about Wordfast Pro and start using it. Keywords: CAT tools, Wordfast Pro, OmegaT INTRODUCTION When using CAT tools, it is Computerized translation has attracted immediately clear which parts of the text the attention of a large number of people must be translated; the unchanging who work directly or indirectly on portions are transferred accurately and translational issues such as professional directly; the time savings due to repeating translators, teachers, linguists, researchers expressions is huge; and expressions are and future translators.
    [Show full text]
  • Hieroglifs Profile.Pdf
    Company profile Hieroglifs Translation has a decade of experience in the fields of medicine, finances, technologies, law, telecommunications, politics, EU, marketing, and tourism as well as other areas. Thanks to our offices in Latvia, Romania and Italy, and reliable cooperation partners all over the world, we are able to offer a very wide range of language combinations. We are proud to say that we translate into more than 150 languages in various combinations. We are collaborating with over 2,000 translators who translate only into their native languages. We have great experience in the Baltic States, Scandinavia and Eastern Europe, thus we are especially well- equipped for Slavic, Baltic and Scandinavian languages. Our translators are specialists (engineers, IT specialists, lawyers, economists, accountants, marketing specialists, linguists and representatives of many other professions. Our services are already chosen by such well-known companies as Continental, Lavazza, Kepro, Casinetto, Tegola, Mutti, Collini Lavori SPA, to name a few. Services We provide a wide range of services such as: •Written translation in 150 languages with proofreading and editing; •Interpreting services (consecutive and simultaneous translation) for business meetings or seminars; •Desktop publishing services when printing booklets, manuals or books; •Transcription services when seminars or other information recorded in audio or videofiles must be converted into printed form; •Localization services – when new product is to be implemented in a new market; •Voiceover services for commercials, cartoons, voicemails etc.; •Multilingual research services – when you are planning to expand your business to other countries, we can help to do researches regarding particular case and many more; •Website design services to keep you website updated and with all the language services that are needed for your target audience; •Graphic design such as flyers, contact cards, ads, brochure and many more; •Braille translation services can cover a wide range of needs.
    [Show full text]
  • JEZIK: a Cognitive Translation System Employing a Single, Visible Spectrum Tracking Detector
    Brigham Young University BYU ScholarsArchive Student Works 2016-06-01 JEZIK: A Cognitive Translation System Employing a Single, Visible Spectrum Tracking Detector Davor Bzik Brigham Young University - Provo, [email protected] Follow this and additional works at: https://scholarsarchive.byu.edu/studentpub Part of the Electrical and Computer Engineering Commons BYU ScholarsArchive Citation Bzik, Davor, "JEZIK: A Cognitive Translation System Employing a Single, Visible Spectrum Tracking Detector" (2016). Student Works. 175. https://scholarsarchive.byu.edu/studentpub/175 This Peer-Reviewed Article is brought to you for free and open access by BYU ScholarsArchive. It has been accepted for inclusion in Student Works by an authorized administrator of BYU ScholarsArchive. For more information, please contact [email protected], [email protected]. JEZIK: A Cognitive Translation System Employing a Single, Visible Spectrum Tracking Detector Davor Bzik A thesis submitted to the faculty of Brigham Young University in partial fulfillment of the requirements for the degree of Master of Science James K. Archibald, Chair D. J. Lee Doran Wilde Department of Electrical and Computer Engineering Brigham Young University June 2016 Copyright © 2016 Davor Bzik All Rights Reserved ABSTRACT JEZIK: A Cognitive Translation System Employing a Single, Visible Spectrum Tracking Detector Davor Bzik Department of Electrical and Computer Engineering, BYU Master of Science A link between eye movement mechanics and the mental processing associated with text reading has been established in the past. The pausing of an eye gaze on a specific word within a sentence reflects correctness or fluency of a translated text. A cognitive translation system has been built employing a single, inexpensive web camera without the use of infrared illumination.
    [Show full text]
  • Unlimited TM and Glossary Access — Accesses an Unlimited Number of Tms and Glossaries Simultaneously and Prioritizes As Primary Or Secondary
    User Guide Copyright © Wordfast, LLC 2019. All rights reserved. Table of Contents Release Notes Summary........................................................................................................................................ 7 Improvements....................................................................................................................................................7 Fixed Issues...................................................................................................................................................... 7 Known Issues....................................................................................................................................................8 1 About this Guide................................................................................................................................................ 9 Conventions.......................................................................................................................................................9 Typographical............................................................................................................................................ 9 Icons.......................................................................................................................................................... 9 2 About Wordfast Pro......................................................................................................................................... 10 3 Get Started.......................................................................................................................................................
    [Show full text]
  • Introducing SDL Trados to Beginning Translators
    Indonesian Journal of English Language Studies Vol. 2, Number 1, February 2016 Introducing SDL Trados to Beginning Translators Lemmuela Alvita Kurniawati, Dian Titi Rahajeng, Barlian Kristanto, and Fidelis Chosa Kastuhandani ABSTRACT Over years, translators have been incorporating new advances in technology into their work. A number of attempts in developing ideal translator’s workstations using technology have been made, one of such stations is a Computer-Aided Translation (CAT) tool. CAT tools facilitate translators to increase their productivity and efficiency by providing them with some utilities, such as a Translation Memory and an Alignment Tool to support their translation works. This article highlights the usefulness of SDL Trados 2014, as one of the most widely used paid CAT tools, in translating the texts more effectively and in a timely manner. Further, it describes the basic steps in using SDL Trados and exemplifies SDL Trados 2014 translation results. A sample text was translated from Indonesian to English using both Across (a free CAT tool) and SDL Trados (a paid CAT tool). Eventually, the results of translating using both CAT tools appear to prove that SDL Trados helps the translators to translate more consistently, accurately, effectively and in a timely-manner. Keywords: CAT tools, SDL Trados, Across. INTRODUCTION used by translators to automate part of the The technological development affects translation process. CAT tools are helpful the new trends in the field of translation. to translators because they can speed up Technology plays the vital role for the translation process through the help of translators nowadays. Translators no translation memory when working with longer work in isolation relying on a pencil very repetitive texts.
    [Show full text]
  • User Guide for Project Managers and Translators
    User Guide for Project Managers and Translators Copyright © Wordfast, LLC 2018. All rights reserved. i Table of Contents Table of Contents About Wordfast Pro 3 Guide ......................................................................................................................... 9 Purpose ..................................................................................................................................................... 9 Audience .................................................................................................................................................... 9 Organization .............................................................................................................................................. 9 Conventions ............................................................................................................................................... 9 Abbreviations and Acronyms ................................................................................................................... 10 About Wordfast Pro ..................................................................................................................................... 11 Key advantages ....................................................................................................................................... 11 Project Manager plug-in workflow ........................................................................................................... 11 TXML editor workflow .............................................................................................................................
    [Show full text]
  • Virtaal Documentation Release 0.7.1
    Virtaal Documentation Release 0.7.1 Translate Jun 26, 2020 Contents 1 Using Virtaal 3 2 Virtaal Features 7 3 Virtaal Screenshots 17 4 Shortcuts 21 5 Tips and Tricks 23 6 Installation 27 7 Contact 29 8 Contributing 31 Index 39 i ii Virtaal Documentation, Release 0.7.1 Virtaal is a graphical translation tool. It is meant to be easy to use and powerful at the same time. Although the initial focus is on software translation (localisation or l10n), we definitely intend it to be useful for several purposes. Virtaal is built on the powerful API of the Translate Toolkit. “Virtaal” is an Afrikaans play on words meaning “For Language”, but also refers to translation. Read more about the features in Virtaal, or view the screenshots. You can also download a screencast (33MB, Ogg Theora format) to see some of these features in action. Learn more about using Virtaal, available shortcuts and some extra tips and tricks for people who want to customise their installation. Contents 1 Virtaal Documentation, Release 0.7.1 2 Contents CHAPTER 1 Using Virtaal Virtaal is meant to be powerful yet simple to use. You can increase your productivity by ensuring you know all the shortcuts and tricks and without being distracted by a cluttered interface. Although most features are available using the mouse, Virtaal is designed to encourage you to work as much as possible with your keyboard to increase your speed and keep the translation fun. When you have no file open in Virtaal, you’ll see the Virtaal dashboard with helpful links to your recent files and various common tasks in the program.
    [Show full text]
  • Anapraseus: Translators' Workhorse
    Tribuna <http://tremedica.org/panacea.html> Anapraseus: translators’ workhorse Oleg Tsygany* Abstract: Article about a new open source Computer Aided Translation tool for the creation, use and management of bi- lingual translation memories in any language, designed as an extension of OpenOffice.org. Features: text segmentation, fuzzy search in the translation memory, terminology recognition, import/export in TMX format, user interface similar to Wordfast’s. Key words: computer-assisted translation, CAT, translation memory, terminology recognition, open source, OpenOffice.org. Una herramienta excelente para el traductor Resumen: Este artículo trata de una herramienta de traducción asistida por ordenador, de código abierto, que se ha diseñado como una extension para OpenOffice.org. Se trata de una herramienta para crear, manejar y utilizar memorias de traduc- ción bilingües en cualquier combinación lingüística. Sus principales características son: segmentación del texto, búsqueda aproximada en la memoria de traducción, reconocimiento de la terminología, importación y exportación en el formato TMX y una interfaz de usuario similar a la de Wordfast. Palabras clave: traducción asistida por ordenador, TAO, memoria de traducción, reconocimiento de terminología, código abierto, OpenOffice.org. Panace@ 2009; 10 (29): 58-60 Modern world is not imaginable without international tools take a special place. CAT stands for “Computer Aided communication. This is one of the foremost challenges of our Translation” and means translation done with specialized times. News, information and culture are increasingly shared software with the capability to increase speed of work while across national boundaries. Personal communication has taken providing quality assurance, terminology management, etc. on new meanings in the age of the Internet.
    [Show full text]