The FISKMÖ Project: Resources and Tools for Finnish-Swedish Machine
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 3808–3815 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC The FISKMO¨ Project: Resources and Tools for Finnish-Swedish Machine Translation and Cross-Linguistic Research Jorg¨ Tiedemann∗, Tommi Nieminen, Mikko Aulamo∗, Jenna Kanervay, Akseli Leinoy, Filip Gintery, Niko Papula ∗Department of Digital Humanities, yDepartment of Future Technologies, Multilizer University of Helsinki, University of Turku, Espoo fjorg.tiedemann, mikko.aulamog@helsinki.fi, fjmnybl,akeele,figintg@utu.fi, [email protected] Abstract This paper presents FISKMO,¨ a project that focuses on the development of resources and tools for cross-linguistic research and machine translation between Finnish and Swedish. The goal of the project is the compilation of a massive parallel corpus out of translated material collected from web sources, public and private organisations and language service providers in Finland with its two official languages. The project also aims at the development of open and freely accessible translation services for those two languages for the general purpose and for domain-specific use. We have released new data sets with over 3 million translation units, a benchmark test set for MT development, pre-trained neural MT models with high coverage and competitive performance and a self-contained MT plugin for a popular CAT tool. The latter enables offline translation without dependencies on external services making it possible to work with highly sensitive data without compromising security concerns. Keywords: parallel corpora, machine translation 1. Introduction and for the optimisation of domain-specific translation en- gines. Finland is a country with two official languages, Finnish This paper reports the achievements of the project includ- and Swedish. The demand for translation between the two ing the data sets we have collected so far, machine trans- languages is huge and translation efforts produce a lot of lation models we have developed and translation tools that overhead and costs in public administration and organisa- we have released. In particular, we describe tions as well as any service providers in the country. Yet there is no systematic collection of translated data nor any • data provided by governmental organisations and aca- large scale effort to develop translation services and tools demic institutions, for the public and administrative use in the country. For • methods for extracting parallel segments from web- this reason, we started FISKMO¨ 1 (finsk-svensk korpus och crawled data, maskinoversattning)¨ a joint project of the universities of Helsinki and Turku and Kites (an umbrella organization • tools for pre-processing and alignment of parallel data, for the language sector in Finland), currently funded by the • the implementation of a general-purpose on-line trans- Swedish Culture Foundation.2 lation service and The primary goal of the project is to create a massive par- allel corpus of Finnish and Swedish to support linguistic • the development of computer-assisted translation research and machine translation (MT) development. The (CAT) tools with fully integrated machine translation. secondary goal of the project is to establish a public ma- The latter is especially interesting for real-world applica- chine translation service for translation between Finnish tions in which translation efforts need to be done off-line and Swedish. As the quantity and quality of training data and with strict security constraints. We released a plugin is the most important factor in machine translation quality, to the popular SDL Trados Studio CAT tool that integrates the FISKMO¨ translation service can be expected to pro- the MT engine inside the plugin itself making it obsolete to vide translations of higher quality than other public ma- send (potentially sensitive) data to on-line services. More chine translation services, for which less training data is details about this tool will be provided in Section 4.2. available. Our mission is to systematically collect material that has 2. Data Collection been translated from Finnish to Swedish and vice versa in The primary goal of FISKMO¨ is the systematic collection the public sector, in private organisations and at language of Finnish-Swedish parallel data from public as well as pri- service providers in Finland. We intend to publish the data vate sources in order to improve the development of ma- with permissive licenses to increase the impact on on-going chine translation between the two languages. Furthermore, research and development. Nevertheless, we will also in- our project intends to support cross-linguistic research by clude restricted data sets that can be used internally for im- making those data sets available to researchers in transla- provements of the coverage of machine translation models tion studies and general linguistics. Computer-aided lan- guage learning is another area where translation data is be- 1https://blogs.helsinki.fi/fiskmo-project/ coming increasingly useful. Our efforts in collecting data 2https://www.kulturfonden.fi relate to two strategies: 3808 Figure 1: The basic workflow for data collection and machine translation development in the FISKMO¨ project. 1. Collecting data from public and private organisations tools in the workflow of professional translators. The ap- and language service providers. We also ask for gen- pearance of the ”EU Presidency Translator”3 developed in eral data donations from the public. collaboration with the Finnish Prime Minister’s Office and their translation unit shed some new light on the integration 2. Web crawling and parallel segment extraction. We de- of MT in human translation demonstrating the benefits of velop and apply methods for the identification of trans- such tools in the translator’s daily work. lated sentences and documents from unrestricted web The private sector has been an even more challenging seg- crawls. ment despite initial promises to donate data. Finnish lan- guage service providers (LSPs) are generally not using MT The workflow for data collection and its connection to ma- in large scale and thus they are not very experienced in that chine translation development is illustrated in Figure 1. Be- technology. It seems that there is a common believe that low, we give more details about the efforts and procedures the development of machine translation would benefit their that we apply. competitors more than themselves and, therefore, it is bet- 2.1. Collecting data from organizations ter to hold back with handing out data. Another issue is also that language service providers usually do not own the So far the majority of the data received through direct con- translations they have done for their clients. It will be nec- tributions has come from public organizations. Public orga- essary to change the way IPRs are regulated between LSPs nizations in Finland are obliged to give their data for public and their customers but permissions could easily be asked use if there are no good grounds to do otherwise. How- from the clients to improve services that they order from ever, collecting the data is not always straightforward as LSPs. In general , convincing LSPs to see the benefits of their publication can differ greatly. Mostly, organisations translation services that they can incorporate in their own refer to their websites to point to their public data sets. workflow requires the successful demonstration of practi- Unfortunately, mining parallel data from the web comes cal tools. Therefore, it is crucial to develop proper inte- with various caveats as website structures and functional- grations into CAT tools similar to the one we introduce in ities vary. Hence, our goal in FISKMO¨ is to obtain the Sections 4.2.. data directly from the original source in suitable electronic To help potential providers to donate their data, we have formats and with explicit correspondences between trans- created different procedures they can follow to support our lations and their original documents. Furthermore, we ex- efforts. pected translation memories to be in wide use internally to make it easy to incorporate collected translation knowledge Donating translation memories: In this method a trans- in our project. lation memory is donated as is. The translation mem- Unfortunately, the situation is more complicated than ex- ory can be an existing translation memory but some- pected for numerous reasons. In some cases, translation times organizations prefer to create a specific one for units work independently from the publishing unit, which this purpose to better control its contents. In return we makes it difficult to identify translation data that is pub- offer automatically cleaned translation memories and lished openly without further editing and modification. Ad- customized MT engines based on the data if there is ditionally, public data and sensitive data are often not sys- sufficient training material and resources available. tematically separated and translation memories incorporate mixed content. Furthermore, copyrights and intellectual Crawling and filtering with translation memories: property rights (IPRs) may not always be clear for all data This method effectively overcomes many privacy files in larger collections. Those reasons and the fear of concerns. We crawl e.g. organization websites and compromising privacy information make it difficult to con- thus create a monolingual corpus