OPUS-CAT: Desktop NMT with CAT Integration and Local Fine-Tuning

Total Page:16

File Type:pdf, Size:1020Kb

OPUS-CAT: Desktop NMT with CAT Integration and Local Fine-Tuning EACL 2021 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE. 000 050 001 OPUS-CAT: Desktop NMT with CAT integration and local fine-tuning 051 002 052 003 053 004 Tommi Nieminen 054 005 University of Helsinki, Yliopistonkatu 3, 00014 University of Helsinki, Finland 055 006 [email protected] 056 007 057 008 058 009 059 010 060 011 061 Abstract translation professionals use MT only occasionally. 012 One of the factors slowing down the adoption of 062 013 OPUS-CAT is a collection of software which MT are risks related to confidentiality and security. 063 014 enables translators to use neural machine trans- There are well-known risks involved with using 064 lation in computer-assisted translation tools 015 web services, which also concern the web-based 065 without exposing themselves to security and 016 066 confidentiality risks inherent in online ma- NMT services available to translators and organi- 017 chine translation. OPUS-CAT uses the public zations: data sent to the service may be intercepted 067 018 OPUS-MT machine translation models, which en route, or it may be misused or handled care- 068 019 are available for over a thousand language lessly by the service provider. These security and 069 020 pairs. The generic OPUS-MT models can be confidentiality risks (even if they are unlikely to ac- 070 021 fine-tuned with OPUS-CAT on the desktop us- tualize) hinder MT use by independent translation 071 ing data for a specific client or domain. 022 professionals, since their clients often specifically 072 023 1 Introduction forbid or restrict the use of web-based MT (Euro- 073 024 pean Commission, 2019). Even if using web-based 074 025 Neural machine translation (NMT) has brought MT is not expressly forbidden, translators may con- 075 026 about a dramatic increase in the quality of ma- sider it unethical or they may fear it might expose 076 027 chine translation in the past five years. The re- them to unexpected legal liabilities (Kamocki et al., 077 028 sults of the latest European Language Industry Sur- 2016). 078 029 vey (FIT Europe et al., 2020) confirm that NMT is Producing MT directly on the translator’s com- 079 030 now routinely used in professional translation work. puter without any communication with external 080 031 NMT systems used in translation work are devel- services eliminates the confidentiality and security 081 oped by specialized machine translation vendors, 032 risks associated with web-based MT. This requires 082 translation agencies, and organizations that have an optimized NMT framework which is capable of 033 083 their own translation departments. Translators use running on Windows computers (as most CAT tools 034 084 NMT either at the request of a client, in which case are only available for Windows), and pre-trained 035 085 the client provides the NMT, or independently, in NMT models for all required language pairs. The 036 086 which case they usually rely on web-based services Marian NMT framework (Junczys-Dowmunt et al., 037 087 offered by large tech companies (such as Google or 2018) fulfills the first requirement, as it is highly op- 038 Microsoft) or specialized machine translation ven- timized and supports Windows builds. Pre-trained 088 039 dors. These web-based services are mainly used NMT models are available from the OPUS-MT 089 040 through machine translation plugins or integrations project (Tiedemann and Thottingal, 2020), which 090 041 that are available for all major computer-assisted trains and publishes Marian-compatible NMT mod- 091 042 translation (CAT) tools, such as SDL Trados and els with the data collected in the OPUS corpus 092 043 memoQ. (Tiedemann, 2012). OPUS-CAT is a software col- 093 044 Even though MT has been extensively used in lection which contains a local MT engine for Win- 094 045 the translation industry for over a decade (Doherty dows computers built around the Marian frame- 095 046 et al., 2013), there is still considerable scope for work and OPUS-MT models, and a selection of 096 047 growth: according to FIT Europe et al.(2020), plugins for CAT tools. OPUS-CAT is aimed at pro- 097 048 78 percent of language service companies plan to fessional translators, which is why it also supports 098 049 increase or start MT use, and most independent the fine-tuning of the base OPUS-MT models with 099 288 Proceedings of the 16th Conference of the European Chapter of the Association1 for Computational Linguistics: System Demonstrations, pages 288–294 April 19 - 23, 2021. ©2021 Association for Computational Linguistics EACL 2021 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE. 100 OPUS TMX file or Trados 150 101 corpora parallel text files plugin 151 102 152 103 153 train models extract fine-tuning 104 install material 154 105 models 155 OPUS-MT 106 locally OPUS-CAT API CAT tool memoQ 156 model 107 MT Engine integration plugin 157 repository 108 158 translate 109 finetune 159 with 110 models 160 111 models 161 112 Locally installed Wordfast (as OmegaT 162 113 Marian NMT custom MT) plugin 163 114 164 115 Figure 1: Diagram of the software and models used in OPUS-CAT. 165 116 166 117 project-specific data. ing the text while newer models use SentencePiece 167 118 (Kudo and Richardson, 2018). Both Subword NMT 168 119 2 OPUS-CAT MT Engine and SentencePiece are coded in Python, but they 169 120 are distributed as standalone Windows executables 170 121 The main component of OPUS-CAT is the OPUS- with OPUS-CAT MT Engine, as requiring the users 171 122 CAT MT Engine, a locally installed Windows appli- to install Python in Windows would complicate the 172 123 cation with a graphical user interface. OPUS-CAT setup process. 173 MT Engine can be used to download NMT mod- 124 OPUS-CAT MT Engine user interface provides 174 els from the OPUS-MT model repository, which 125 a simple functionality for translating text, but the 175 contains models for over a thousand language pairs. 126 translations are mainly intended to be generated via 176 127 an API that the OPUS-CAT MT Engine exposes. 177 128 This API can be used via two protocols: net.tcp 178 129 and HTTP. net.tcp is used with plugins for the SDL 179 130 Trados and memoQ CAT tools, while HTTP is used 180 131 for other plugins and integration. The motivation 181 132 for using net.tcp is that exposing a net.tcp service 182 133 on the local Windows computer does not require 183 administrator privileges, which makes setting up 134 184 the OPUS-CAT MT Engine much easier for non- 135 185 technical users. However, Trados and memoQ are 136 186 the only CAT tools with sufficiently sophisticated 137 187 Figure 2: Install OPUS-MT models locally (1,000+ lan- plugin development kits to allow for net.scp con- 138 188 guage pairs available) nections, so the API can also be used via HTTP 139 with some extra configuration steps, so that it can 189 140 Once a model has been downloaded, OPUS-CAT be used from other tools. The API has three main 190 141 MT Engine can use it to generate translations by functionalities: 191 142 invoking a Marian executable included in the in- 192 • Translate: Generates a translation for a 143 stallation. Before the text is sent to the Marian 193 source sentence (or retrieves it from a cache) 144 executable, OPUS-CAT MT Engine automatically 194 and returns it as a reply to the request. 145 pre-processes the text using the same method that 195 146 was originally used for pre-processing the train- • PreorderBatch: Adds a batch of source sen- 196 147 ing corpus of the model. Pre-processing is model- tences to the translation queue and immedi- 197 148 specific, as the older OPUS-MT models use Sub- ately returns a confirmation without waiting 198 149 word NMT (Sennrich et al., 2016) for segment- for the translations to be generated. 199 289 2 EACL 2021 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE. 200 • Customize: Initiates model customization us- 250 201 ing the fine-tuning material included in the 251 202 request. 252 203 253 The OPUS-CAT MT Engine stores the local 204 254 NMT models in the user’s application data folder 205 255 in order to avoid file permission issues. Local ap- 206 256 plication data folder is used, as saving the models 207 257 in the roaming application data folder could lead 208 to unwanted copying of the models, if same user 258 209 profile is used on multiple computers. The user 259 210 260 interface of the OPUS-CAT MT Engine contains Figure 3: Trados plugin settings. Note the preordering 211 261 functionalities for managing models installed on function and model tag. 212 the computer, such as deletion of models, packag- 262 213 ing of models for migration to other systems, and 263 214 tagging the models with descriptive tags. The tags in the plugin settings) to the OPUS-CAT for trans- 264 215 can be used to select specific models in CAT tool lation. This means that when the translator moves 265 216 plugins, e.g. a model fine-tuned for a specific cus- to the next segment, it has already been translated 266 217 tomer can be tagged with the name of the customer. and can simply be retrieved from the OPUS-CAT 267 218 translation cache. 268 219 3 CAT tool plugins and integration 269 4 Local fine-tuning of models 220 OPUS-CAT contains plugins for three CAT tools: 270 221 SDL Trados, memoQ and OmegaT.
Recommended publications
  • Translators' Tool
    The Translator’s Tool Box A Computer Primer for Translators by Jost Zetzsche Version 9, December 2010 Copyright © 2010 International Writers’ Group, LLC. All rights reserved. This document, or any part thereof, may not be reproduced or transmitted electronically or by any other means without the prior written permission of International Writers’ Group, LLC. ABBYY FineReader and PDF Transformer are copyrighted by ABBYY Software House. Acrobat, Acrobat Reader, Dreamweaver, FrameMaker, HomeSite, InDesign, Illustrator, PageMaker, Photoshop, and RoboHelp are registered trademarks of Adobe Systems Inc. Acrocheck is copyrighted by acrolinx GmbH. Acronis True Image is a trademark of Acronis, Inc. Across is a trademark of Nero AG. AllChars is copyrighted by Jeroen Laarhoven. ApSIC Xbench and Comparator are copyrighted by ApSIC S.L. Araxis Merge is copyrighted by Araxis Ltd. ASAP Utilities is copyrighted by eGate Internet Solutions. Authoring Memory Tool is copyrighted by Sajan. Belarc Advisor is a trademark of Belarc, Inc. Catalyst and Publisher are trademarks of Alchemy Software Development Ltd. ClipMate is a trademark of Thornsoft Development. ColourProof, ColourTagger, and QA Solution are copyrighted by Yamagata Europe. Complete Word Count is copyrighted by Shauna Kelly. CopyFlow is a trademark of North Atlantic Publishing Systems, Inc. CrossCheck is copyrighted by Global Databases, Ltd. Déjà Vu is a trademark of ATRIL Language Engineering, S.L. Docucom PDF Driver is copyrighted by Zeon Corporation. dtSearch is a trademark of dtSearch Corp. EasyCleaner is a trademark of ToniArts. ExamDiff Pro is a trademark of Prestosoft. EmEditor is copyrighted by Emura Software inc. Error Spy is copyrighted by D.O.G. GmbH. FileHippo is copyrighted by FileHippo.com.
    [Show full text]
  • Webinar: Translation Memory and Machine Translation
    Legal Services National Technology Assistance Project www.lsntap.org Webinar: Translation Memory and Machine Translation Jillian Theil, Claudia Johnson, Diana Glick, Leland Sampson, Maria Mindlin and Sart Rowe Machine Translation The Perils of Google Translate Typically Google translate tool will give you broken language translations. You might be able to tell what the translation is saying but it will be grammatically incorrect and use wrong words. The key when creating something like a sign, is to use more visuals and less words. Arrow signs, people icons, Dollar signs etc. Check for existing resources - look for signs that already exist and you can see what they are doing. Consider using visuals for wayfinding - arrows, etc Conduct Plain language review and editing Ensure your signage is readable (font etc) Back translation is running a translation back into english to ensure it is correct. Case Study: Using chinese as an example: the word for computer mouse is a different word than the animal mouse - so google translate will translate a sentence about this wrong. If possible obtain a legal review of the translation from an attorney / translator. Is it Ever OK to use Google Translate? It’s ok for informal communications, for general understanding or when you are in a complete bind and have no other options. Translation Workflow for Lingotek and People’s Law Library 1. Volunteer contacts them, and they qualify that volunteer 2. The volunteer selects an article to translate and that article is uploaded to lingotek 3. The volunteer performs the actual translation and then the article is assigned to a volunteer reviewer who is a licensed attorney 4.
    [Show full text]
  • Table of Contents
    User Guide Copyright © Wordfast, LLC 2019. All rights reserved. Table of Contents Release Notes Summary........................................................................................................................................ 7 New Features....................................................................................................................................................7 Improvements....................................................................................................................................................7 Fixed Issues...................................................................................................................................................... 7 1 About this Guide................................................................................................................................................ 9 Conventions.......................................................................................................................................................9 Typographical............................................................................................................................................ 9 Icons.......................................................................................................................................................... 9 2 About Wordfast Pro......................................................................................................................................... 10 3 Get Started.......................................................................................................................................................
    [Show full text]
  • Use of E-Learning in the Training of Professionals in the Knowledge Society
    Use of E-learning in the Training of Professionals in the Knowledge Society University of Silesia in Katowice, Faculty of Ethnology and Sciences of Education in Cieszyn Use of E-learning in the Training of Professionals in the Knowledge Society Monograph Scientific Editor Eugenia Smyrnova-Trybulska Cieszyn - Katowice 2010 Reviewer Maciej Tanaś © Copyright by University of Silesia in Katowice, Poland, 2010 ISBN: 978-83-60071-30-4 Printing and binding: Publisher: Katowice – Rybnik http://www.studio-noa.pl for University of Silesia in Katowice Faculty of Ethnology and Sciences of Education in Cieszyn TABLE OF CONTENTS Introduction 9 I. Theoretical and Methodical Aspects of Distance Learning 13 Antonio Dos Reis (Portugal) E-Learning –The E-Volution…………………………..……………. 13 Jana Šarmanová, Kateřina Kostolányová, Ondřej Takács (Czech Republic) Intelligent Virtual Teacher ………………………………………….. 33 Halina Widła, Izabela Mrochen, Ewa Półtorak (Poland) Development of Education with the Use of ICT (Information and Communication Technologies) Implemented in Modern Foreign Language Studies .………………………….………………………… 47 Leszek Rudak (Poland) Spot-The-Difference – Traditional Education Vs. E-Education…… 75 Antoni Pardała (Poland) Methods of Mathematics Teaching vs. Distance Education ……..... 91 Jana Šarmanová, Kateřina Kostolányová (Czech Republic) Methodology for Creating Adaptive Teaching Support…………….. 105 II. Practical Aspects of Distance Learning. Distance Learning and Lifelong Learning:………………………………… 117 Franz Feiner, Anton Lanz (Austria) The Concept of EPICT (European Pedagogical ICT Licence) and the Implementation in Austria, KPH Graz………….......…….……. 117 Eugenia Smyrnova-Trybulska (Poland) Use of Distance Learning in the Training of Professionals in the Knowledge Society……………………………………………………. 137 6 Table of Contents Anita Dąbrowicz-Tlałka, Hanna Guze (Poland) Supporting First Year Students Through Blended-Learning - Planning Effective Courses and Learner Support…………...…….
    [Show full text]
  • Çì²ñò / Contents
    Advanced Linguistics 4 / 2019 ISSN 2617-5339 DOI 10.20535/2617-5339.2019.4.189753 UDC 81.11’255.2:62 Valeriia Havrylenko Lecturer National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute” Kyiv, Ukraine. ORCID ID 0000-0001-6873-093X [email protected] COMPARISON OF AUTOMATIC SYSTEMS OF TERMS’ EXTRACTION Abstract. Nowadays the processes of translation become more unified, and translators depend not only on their knowledge and sense of language, but also on various software, which facilitate the process of translation. The following article is devoted to one branch of such software, the systems of automatic extraction, which are an essential part in the process of lexicographic sources development of translation of text, which include a variety of terms. Consequently, the necessity to choose among the variety of different programs arose and the results of this research i.e. the comparison of functions of different programs, are described in our article. Several criteria, by which the quality of terms extraction can be measured, have been compared, e.g., the speed of extraction, the “purity” of the output list of terms, whether the extracted lexical material corresponded to the requirements to terms, the quality of irrelevant choices, extracted by automatic extraction systems, and the factors, influencing this quality, etc. The advantages and disadvantages of cloud and desktop services have been investigated and compared. It was noted that the main difficulty is that programs still are not able to distinguish between word forms, thus the texts that undergo the extraction process, require auxiliary procedures such as POS-marking, lemmatization and tokenization.
    [Show full text]
  • Technical Requirements
    Plunet BusinessManager Technical requirements (valid from 03.11.2020) AUTHOR Plunet GmbH Contents Technology 3 Database 3 SSL / https 3 Ports 3 Data security 4 System requirements 4 Plunet compatibility 6 Supported dynamic CAT versions (add-on modules available for purchase) 6 Supported file-based CAT versions (included interfaces) 10 Copyright Notice 11 Plunet BusinessManager: Technical requirements Page 2 of 11 Technology l Server: Servlet in the Java programming language l Client: HTML pages with JavaScript Database l The project data can be hosted on a separate server, independently of the database. l Plunet BusinessManager supports the following SQL database systems: MySQL, MS-SQL Server. o Licensing for on-premise MS-SQL: The customer needs to provide the required Microsoft SQL Server license. o Licensing for on-premise MySQL: The free MySQL Community Version can be used. If you require support for MySQL, you will need to purchase a license from Oracle o Licensing for hosting: The required licenses will be provided by Plunet. SSL / https l Client and server exchange data using http. l SSL is also supported (using https instead of http). l For security reasons, it is generally recommended to use https. This enables a secure encrypted connection between the server and client. Ports l The port for outbound connections can be freely defined. Port 80 is the default port. l Port 443 is the default port for encryption with SSL. Plunet BusinessManager: Technical requirements Page 3 of 11 l It may be necessary to enable further ports for the purpose of remote maintenance/support. Data security l Plunet BusinessManager data should be included in the customer’s general backup plan.
    [Show full text]
  • QUICK START GUIDE for MEMOQ 2015 for Project Managers
    QUICK START GUIDE FOR MEMOQ 2015 For project managers © 2004-2015 Kilgray Translation Technologies. All rights reserved. www.memoQ.com Quick start guide for project managers Contents Contents ...................................................................................................................................... 1 1 Introduction .............................................................................................................................. 4 1.1 Projects ...................................................................................................................................... 4 1.2 The PM Dashboard .................................................................................................................... 4 2 Creating a project ...................................................................................................................... 8 2.1 Creating an online project ......................................................................................................... 8 2.1.1 Creating an online project from a template ......................................................................... 14 2.1.2 Creating an online project as a member of a light-weight PM group .................................. 16 2.2 Assigning users to an online project........................................................................................ 18 2.2.1 A project on autopilot .......................................................................................................... 22 2.2.2 Assigning
    [Show full text]
  • Quick Start Guide for Project Managers
    integrated translation environment quick start guide for project managers © 2004-2015 Kilgray Translation Technologies. All rights reserved. Quick start guide for project managers Contents Contents ...................................................................................................................................... 2 1 Introduction .............................................................................................................................. 4 Projects ............................................................................................................................................ 4 2 Creating a project ...................................................................................................................... 5 2.1 Creating an online project ......................................................................................................... 5 2.1.1 Creating an online project from a template ......................................................................... 11 2.2 Assigning users to an online project........................................................................................ 13 2.2.1 A project on autopilot .......................................................................................................... 17 2.2.2 Assigning a subvendor to an online project ......................................................................... 18 2.3 Using the Group Assignment features: FirstAccept, GroupSourcing, Slicing .......................... 19 2.3.1 FirstAccept ...........................................................................................................................
    [Show full text]
  • Response to Rfp# Eq-052920-03
    RESPONSE TO RFP# EQ-052920-03 Translation services for Education Service Center, Region 10 (“Region 10 ESC”) Submitted by: Idea Language Services, LLC (dba Idea Translations) 8719 Evangel Drive, Springfield, Virginia 22153, USA Attn: Sergio Atristain, Director [email protected] www.ideatranslations.com (860) 781-6377 ATTACHMENT A: Requirements for Lead Agency Agreement To be administered by Equalis Group The following exhibits are used in evaluating and administering Lead Agency Agreements and are preferred by Equalis Group. Redlined copies of the exhibits should not be submitted with the response. Should a respondent be recommended for award, these exhibits will be negotiated and executed between Equalis Group and the respondent. Respondents must select one of the following options for submitting their response and submit this page only. X Respondent agrees to all terms and conditions outlined in each of the following exhibits Respondent wishes to negotiate directly with Equalis Group on terms and conditions outlined in each of the following exhibits. Negotiations will commence after sealed bids are opened and Region 10 has determined the respondent met all requirements in their response and may be eligible for award. • Equalis Group Exhibit A – EQUALIS GROUP RESPONSE FOR LEAD AGENCY AGREEMENT • Equalis Group Exhibit B – EQUALIS GROUP ADMINISTRATION AGREEMENT • Equalis Group Exhibit C – EQUALIS GROUP MASTER INTERGOVERNMENTAL COOPERATIVE PURCHASING AGREEMENT • Equalis Group Exhibit D – EQUALIS GROUP CONTRACT SALES REPORTING
    [Show full text]
  • FREE CAT TOOLS AS an ALTERNATIVE to COMMERCIAL SOFTWARE: Omegat
    FACULTAD DE TRADUCCIÓN E INTERPRETACIÓN Grado en Traducción e Interpretación TRABAJO FIN DE GRADO FREE CAT TOOLS AS AN ALTERNATIVE TO COMMERCIAL SOFTWARE: OmegaT Presentado por Veronica Nicoleta Anica Tutelado por Ana María Alconchel Soria, 2014 Free CAT tools as an alternative to commercial software: OmegaT Content ACKNOWLEDGEMENT ............................................................................................................... 4 I. INTRODUCTION ....................................................................................................................... 6 1. Connection with competencies ............................................................................................ 7 1.1. General competencies ................................................................................................... 7 1.2. Specific competencies ................................................................................................... 8 1. PURPOSE ............................................................................................................................. 10 2. METHODOLOGY ................................................................................................................... 11 II. THEORETICAL APPROACH ................................................................................................... 13 1. Translation and technology ................................................................................................ 13 1.1. Technological advances and the process of globalization
    [Show full text]
  • Translation Memory Management
    Translation Memory Management SDL Trados Studio 2021 SR1 November 2020 Legal notice Copyright and trademark information relating to this product release. Copyright © 2000–2021 SDL as part of the RWS Holdings Plc group of companies ("RWS Group"). SDL means SDL Limited and its subsidiaries and affiliates. All intellectual property rights contained herein are the sole and exclusive rights of SDL. All references to SDL shall mean SDL Limited and its subsidiaries and affiliates details of which can be obtained upon written request. All rights reserved. Unless explicitly stated otherwise, all intellectual property rights including those in copyright in the content of this website and documentation are owned by or controlled for these purposes by SDL. Except as otherwise expressly permitted hereunder or in accordance with copyright legislation, the content of this site, and/or the documentation may not be copied, reproduced, re- published, downloaded, posted, broadcast or transmitted in any way without the express written permission of SDL. SDL Trados Studio is a registered trademark of SDL. All other trademarks are the property of their respective owners. The names of other companies and products mentioned herein may be the trade- marks of their respective owners. Unless stated to the contrary, no association with any other company or product is intended or should be inferred. This product may include open source or similar third-party software, details of which can be found by clicking the following link: “Acknowledgments ” on page 35 . Although RWS Group takes all reasonable measures to provide accurate and comprehensive information about the product, this information is provided as-is and all warranties, conditions or other terms concerning the documentation whether express or implied by statute, common law or otherwise (including those relating to satisfactory quality and fitness for purposes) are excluded to the extent permitted by law.
    [Show full text]
  • Transit/Termstar Feature Guide Transit/Termstar Contents
    Feature Guide Valid from: Transit/TermStar NXT Service Pack 13 © STAR Group © STAR Released: 2020-09 Transit/TermStar Contents Licensing ......................................................... 3 Sharing translation memories with other systems (TMX) ........................................................ 16 Supported operating systems ............................ 3 17 Recommended hardware ................................. 3 Alignment ..................................................... Machine Translation (MT) ............................. 17 Network installation ........................................ 3 Sharing data with earlier Transit versions ...... 18 TermStar .......................................................... 3 19 Optional supported databases for terminology . 4 User interface ............................................... Configuring Transit ....................................... 19 Supported Languages ....................................... 4 Transit Editor ................................................. 20 Supported file types for creating / importing / exporting projects ........................................ 5 Markup handling .......................................... 22 Supported formats of other translation memory Internal repetitions ........................................ 22 systems ........................................................ 7 Web search ................................................... 23 Supported formats for terminology import and Terminology for translation projects .............
    [Show full text]